Web Scraping using Python! Create your own Dataset

Machine Learning requires a lot of data and not always it is easy to get the data you want. Have you ever wondered how Kaggle and other such websites provide us with huge datasets? The answer is web scraping. So, let us see how we can extract data from the web.
Let’s assume we are building a model which requires movie information such as title, summary, and rating of a number of movies. When it comes to movies, we know IMDB has the largest database. Let us dig into it.

What exactly we do to scrape a webpage?

There’s a pattern in everything. We need to observe and find a pattern in the HTML code of the web page to extract relevant data. Let’s go step by step. We will be doing everything using python and scrape the data from the following URL :
https://www.imdb.com/search/title?release_date=2019&sort=user_rating,desc&ref_=adv_nxt

1. Install dependencies

# To download the webpage
pip install requests
# To scrape data from the downloaded webpage
pip install beautifulsoup4

2. Download the webpage

“Requests” is a great HTTP library to make request calls. We will use it to download the webpage of the given URL.

import requests
url = "https://www.imdb.com/search/title?release_date=2019&sort=user_rating,desc&ref_=adv_nxt"
# get() method downloads the entire HTML of the provided url
response = requests.get(url)
# Get the text from the response object
response_text = response.text

3. Inspecting elements and finding the pattern

Now the data we have downloaded is exactly the same you see when you right-click and do inspect element in the browser. Let’s right-click on the rating and see how we can extract it.

When we look closely we will see the class “ratings-bar” contains the rating of the movie. If we inspect other movies, we will find all the movies have the same class name for the ratings on that page. Here, we found a pattern to extract all the ratings from the page. Similarly, we can extract summary, title, genre, etc.

Not only using class but you can select a specific part of the HTML code using id, tags, etc as well.

Let’s jump into the code!

BeautifulSoup allows us to extract data(more precisely parse data) from HTML using the class name, id, tags, etc. Isn’t it Beautiful? :-D

from bs4 import BeautifulSoup
# Create a BeautifulSoup object
# response_text -> The downloaded webpage
# lxml -> Used for processing HTML and XML pages
soup = BeautifulSoup(response_text,'lxml')

To select the content from the page we use CSS Selectors. CSS Selectors allows us to select different classes, ids, tags, and other html elements easily. CSS Selector for Class is "." and for ID is "#". To select a class we need to prefix a "." to the class name we want to extract and similarly, for ID we need to prefix "#".

# As we saw the rating's class name was "ratings-bar" 
# we prefix "." since its a class
rating_class_selector = ".ratings-bar"
# Extract the all the ratings class
rating_list = soup.select(rating_class_selector)

This “rating_list” is the list of object containing all the <div> elements containing “ratings-bar” as class name. We need to get the text from within the div element.

Here’s how a single rating object looks like:

<div class="ratings-bar">
<div class="inline-block ratings-imdb-rating" data-value="10" name="ir">
<span class="global-sprite rating-star imdb-rating"></span>
<strong>10.0</strong>
</div>
...
</div>

We need to get the rating value from the <strong> tag. We can extract the tags using find(‘tagName’) method and get the text using getText().

# This List will store all the ratings
ratings = []
# Iterate through all the ratings object
for rating_object in rating_list:
    # Find the <strong> tag and get the Text
    rating_text = rating_object.find('strong').getText() 
    # Append the rating to the list
    ratings.append(rating_text)
print(ratings)

And we are done. Similarly, you can extract Titles, Summary, Genre using the above method with the appropriate class name and tag names.

You can store the data to CSV or excel file and use it for your Machine Learning model.

Full Code present on my Github:

Follow me on Twitter: