Web Scraping in Python with BeautifulSoup

Enver Şanlı
5 min readDec 22, 2020

--

Python 3

Hello everyone!

Web development is growing step by step and every day it happens. Every day developing a new web app with new technologies or technics. We will talk to one of them today about Python.

Python is a popular language and it has strong communities. Python let us for a lot of technic for web scraping. Also, it has strong documents and libraries for that. We are going to do a basic project to scrape data from any website. I can suggest CoLab develop Python projects easily and you do not need anything else to develop. You can find that on your Google Drive.

Firstly, I will show you how to create a Python project on your CoLab studio.

  • Go to your drive
  • Click to “New” button
Click New(Yeni) Button on your Drive
  • After, find the “Others” option and choose “Google Colaborotary”
  • You have been created your first CoLab project to develop Python.
  • You are ready now to develop, then Let’s Start scraping some data!

We are going to import some libraries like request, BeautifulSoup. BeautifulSoup is useful scraping libraries and it let us to catching HTML tags on HTML content.

  • İmporting Request libraries and we are going to use that to load a website’s HTML content.
İmporting REQUEST
  • After that, we are importing BeautifulSoup
BeautifulSoup is ready

Now, We are ready to get a web page.

Our HTML content will be “http://sancar.org/aziz-sancar/”. We will get that Bio content as HTML to our Python project. Then, we will be able to do web scraping.

  • Requesting to the web site which mentioned above.
İmporting web site
  • If you print the “r” variable, you will see <Response [200]> as a result and it’s mean is our request is a success.
  • Let’s get HTML tags now with BeautifulSoup
BeautifulSoap in Python
  • Now, we got our HTML tags with BeautifulSoup and our HTML content is ready. We are going to print it and see HTML content.
Html content in Python

As you can see above, we printed our HTML content which was created by BeautifulSoap. On the photos, we display divs, paragraphs, and list tags. We can reach all of them by using their tag name, id, class, or any attributes. Let’s see how we do that.

  • Firstly, we are going to catch the Web Site’s title with the class name. Before that, Let’s see how looks that website!
sancar.org
  • We need to catch the website title “AZIZ SANCHAR / BIO”. Now, Let’s see the title tag and its class name.
  • Click right on your mouse and display web site’s HTML source
html title
  • We found title’s tag, “entry-title” which in <h1> tag. We are going to use a BeautifulSoap’s function to reach the h1 tag. That function name is “find_all”. It will search class name and h1 tag in whole HTML content and it will load it to the “title” variable.
  • You see below our result.
h1 tag result
  • Now, we need to get a clear value of the title as Aziz Sancar. We are going to use the “text” function for that.
  • find_all method response us an array and we need to use its an index to get a clear text. As you can see below, we were able to print the entry title's text with that method.
text method
  • Now, we can catch p tag in html content without classname like above. We will use find_all and for loop to print it all content without html tags. We will use also index counter as “i” variable in our project.
p tag’s content
  • We did it. Now, we get all the p tag’s text from HTML content by clearing. You got it all data from a real website. Whatever you want, you can manage that data. You can save to your database, make text mining, or do anything you want.

As you see in our basic project, BeautifulSoup is really useful and easy to use. You can develop more powerful scripts or projects. That’s really awesome technic for collecting data.

I will share all project’s sources below, you can practice on it or improve more.

Thank you for everything!

Enjoy Your Code !

Enver ŞANLI — Web Developer, Social Thinker, and Farmer

Resources :

--

--

Enver Şanlı
Enver Şanlı

No responses yet