Web Scraping in Python with BeautifulSoup
Hello everyone!
Web development is growing step by step and every day it happens. Every day developing a new web app with new technologies or technics. We will talk to one of them today about Python.
Python is a popular language and it has strong communities. Python let us for a lot of technic for web scraping. Also, it has strong documents and libraries for that. We are going to do a basic project to scrape data from any website. I can suggest CoLab develop Python projects easily and you do not need anything else to develop. You can find that on your Google Drive.
Firstly, I will show you how to create a Python project on your CoLab studio.
- Go to your drive
- Click to “New” button
- After, find the “Others” option and choose “Google Colaborotary”
- You have been created your first CoLab project to develop Python.
- You are ready now to develop, then Let’s Start scraping some data!
We are going to import some libraries like request, BeautifulSoup. BeautifulSoup is useful scraping libraries and it let us to catching HTML tags on HTML content.
- İmporting Request libraries and we are going to use that to load a website’s HTML content.
- After that, we are importing BeautifulSoup
Now, We are ready to get a web page.
Our HTML content will be “http://sancar.org/aziz-sancar/”. We will get that Bio content as HTML to our Python project. Then, we will be able to do web scraping.
- Requesting to the web site which mentioned above.
- If you print the “r” variable, you will see <Response [200]> as a result and it’s mean is our request is a success.
- Let’s get HTML tags now with BeautifulSoup
- Now, we got our HTML tags with BeautifulSoup and our HTML content is ready. We are going to print it and see HTML content.
As you can see above, we printed our HTML content which was created by BeautifulSoap. On the photos, we display divs, paragraphs, and list tags. We can reach all of them by using their tag name, id, class, or any attributes. Let’s see how we do that.
- Firstly, we are going to catch the Web Site’s title with the class name. Before that, Let’s see how looks that website!
- We need to catch the website title “AZIZ SANCHAR / BIO”. Now, Let’s see the title tag and its class name.
- Click right on your mouse and display web site’s HTML source
- We found title’s tag, “entry-title” which in <h1> tag. We are going to use a BeautifulSoap’s function to reach the h1 tag. That function name is “find_all”. It will search class name and h1 tag in whole HTML content and it will load it to the “title” variable.
- You see below our result.
- Now, we need to get a clear value of the title as Aziz Sancar. We are going to use the “text” function for that.
- find_all method response us an array and we need to use its an index to get a clear text. As you can see below, we were able to print the entry title's text with that method.
- Now, we can catch p tag in html content without classname like above. We will use find_all and for loop to print it all content without html tags. We will use also index counter as “i” variable in our project.
- We did it. Now, we get all the p tag’s text from HTML content by clearing. You got it all data from a real website. Whatever you want, you can manage that data. You can save to your database, make text mining, or do anything you want.
As you see in our basic project, BeautifulSoup is really useful and easy to use. You can develop more powerful scripts or projects. That’s really awesome technic for collecting data.
I will share all project’s sources below, you can practice on it or improve more.
Thank you for everything!
Enjoy Your Code !
Enver ŞANLI — Web Developer, Social Thinker, and Farmer
Resources :