20
loading...
This website collects cookies to deliver better user experience
Scraping websites can be finicky. You're at the whim of the content creators markdown decisions.
p
tags while some have their entire movie list formatted within a single span
element. This is a problem.pip install selenium
.I highly recommend taking advantage of a virtualenv and creating an isolated Python environment.
import json
from selenium import webdriver
driver = webdriver.Chrome('chromedriver') // 1
driver.get('https://www.elacervo.com/directores') // 2
// 3
for i in range(4):
time.sleep(5)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
// 4
directors = driver.find_elements_by_css_selector(
"a[href*='https://www.elacervo.com/post/']"
)
// 5
unique_directors = []
for link in directors:
if (link.get_attribute("href")) not in unique_directors:
unique_directors.append(link.get_attribute("href"))
// 6
names = []
for link in unique_directors:
slug = link.split('/')[-1]
name = slug.replace('-', ' ').title()
names.append({"name": name})
// 7
with open('directors.json', 'w') as outfile:
json.dump(names, outfile)
// 8
driver.quit()
driver = webdriver.Chrome('chromedriver')
- This is where we are telling Selenium to use spawn a new Google Chrome instance. The value, chromedriver
, that we are passing to the .Chrome()
method is the location of the chromedriver file we downloaded in the previous step.driver.get('https://www.elacervo.com/directores')
- Here we are telling our now made Selenium driver to navigate to the URL https://www.elacervo.com/directores
.a
tag with and href
that contains https://www.elacervo.com/post/
. This is using the logic href*=
which includes the wildcard character *
.href
source. It's then placing the URL into a unique_directors
list. Some of the directors on this page have their link twice so I'm removing any duplicate URLs.https://www.elacervo.com/post/martin-scorsese
. The logic here is taking everything after the last /
character, replacing -
's with spaces, and then capitalizing the first letter of each word within their names.json.dump
to write the gathered director names into a json
file for quicker use later on. Reading from a json
file is much quicker than spawning a browser to click around and extract data.driver.quit()
- This closes the a Selenium Chrome instance.import json
from imdb import IMDb
file = open('directors.json',)
directors = json.load(file)
movies = []
ia = IMDb()
for person in directors:
try:
director = ia.search_person(person['name'])[0]
try:
films = ia.get_person_filmography(director.personID)['data']['filmography']['director']
for film in films:
if film['kind'] == 'movie':
try:
if (film['year']):
movies.append(film)
except KeyError:
continue
except AttributeError:
continue
except IndexError:
continue
with open('movies.json', 'w') as outfile:
json.dump([{"title": movie['title'], 'year': movie['year']} for movie in movies], outfile)
directors.json
file we created in the Selenium section. Then using Python's JSON decoder, we can load data from the file into a usable JSON format..search_person(person['name'])
returns a list of people IMDb has within in their database. It appears the first result in the returned list is the most popular and reasoning behind the [0]
. For this project, I'm making the assumption that is the director I want to work with.Movie
objects properties can be seen documented here. For this project, I'm just interested in movies, so I apply a conditional to check, appending the accepted films to a movies list.year
if the movie has been released, otherwise, it has the property status
. I only want movies that are watchable now, and filter out the data accordingly.movies = []
if film['kind'] == 'movie':
try:
if (film['year']):
movies.append(film)
except KeyError:
continue
with open('movies.json', 'w') as outfile:
json.dump([{"title": movie['title'], 'year': movie['year']} for movie in movies], outfile)
import random
import json
file = open('movies.json')
data = json.load(file)
print(random.choice(data))