22
loading...
This website collects cookies to deliver better user experience
CSS
selectors are one of the best friends. This tutorial will tell you what they're, their pros and cons, and why they matter from a web scraping perspective with Python examples to get you going.This blog post is not a complete CSS
selectors reference, but a mini-guided tour of frequently used type of selectors and how to work them.
CSS
selectors when doing web page web scraping, and what tools might be handy to use in addition to Python beautifulsoup
, lxml
libraries. CSS
selectors in different languages, frameworks, packages are not much different.pip install lxml
pip install beautifulsoup4
bs4
library or whatever HTML parser package/framework you're using.CSS
selectors on your own. If you don't want to do that, installing these libraries is not required.CSS
selectors are patterns used to select (match) the element(s) you want to CSS
selector(s) by clicking on desired element in your browser, and returns a CSS
selector(s).SelectorGadget is an open-source tool that makes CSS
selector generation and discovery on complicated sites a breeze.
Nokogiri
and BeautifulSoup
.jQuery
selectors for dynamic sites.selenium
or phantomjs
testing.CSS
selector(s) or HTML elements by their:<input>
.class
#id
[attribute]
Syntax: element_name
soup.select('a') # returns all <a> elements
soup.select('span') # returns all <span> elements
soup.select('input') # returns all <input> elements
soup.select('script') # returns all <script> elements
Syntax: .class_name
PressF().when_playing_cod()
.soup.select('.mt-5') # returns all elements with current .selector
soup.select('.crayons-avatar__image') # returns all elements with current .selector
soup.select('.w3-btn') # returns all elements with current .selector
Syntax: #id_value
id
attribute. In order for the element to be selected, its id
attribute must match exactly the value given in the selector.soup.select('#eob_16') # returns all elements with current #selector
soup.select('#notifications-link') # returns all elements with current #selector
soup.select('#value_hover') # returns all elements with current #selector
Syntax: [attribute=attribute_value]
or [attribute]
, more examples.
[]
instead of a dot (.
) as class, or a hash (or octothorpe) symbol (#
) as ID.soup.select('[jscontroller="K6HGfd"]') # returns all elements with current [selector]
soup.select('[data-ved="2ascASqwfaspoi_SA8"]') # returns all elements with current [selector]
# elements with an attribute name of data-id
soup.select('[data-id]') # returns all elements with current [selector]
Syntax: element, element, element, ...
CSS
selectors is great (in my opinion) to handle different HTML layouts because if one of the selectors is present it will grab all elements from an existing selector.# will return all elements either by one of these selectors
soup.select('#kp-wp-tab-Albums .PZPZlf, .keP9hb')
Syntax: selector1 selector2
) character and selects two selectors such that elements matched by the second selector are selected if they have an ancestor (parent, parent's parent, parent, etc) element matching the first selector.soup.select('.NQyKp .REySof') # dives insie .selector -> dives again to other .selector and grabs it
soup.select('div cite.iUh30') # dives inside div -> dives inside cite.selector and grabs it
soup.select('span#21Xy a.XZx2') # dives inside span#id -> dives insize a.selector and grabs it
CSS
Selectors:nth-child(n)
: Selects every n
element that is the second child of its parent.:nth-of-type(n)
: Selects every n
element that is the second n
element of its parent.a:has(img)
: Selects every element <a>
element that has an <img>
element.CSS
selectors you can find on W3C Level 4 Selectors, W3Schools CSS
Selectors Reference, and MDN documentation.CSS
selector(s) in the SelectorGadget window and see what elements being selected:$$(".DKV0Md")
document.querySelectorAll(".selector")
method (according to Chrome Developers website):document.querySelectorAll(".DKV0Md")
CSS
selectors) would be to use selectors such as attribute selectors (mentioned above), they are likely to change less frequently. See attribute selectors examples on the screenshot below (HTML from Google Organic results):CSS
selectors for every change that is being made to certain style component, which means that rely exclusively on them is not a good idea. But again, it will depend on how often do they really change.CSS
selector(s) to make the code running properly. Seems like not a big deal, which is true, but it might be annoying if selectors are changing frequently.CSS
container selector:import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36"
}
html = requests.get("https://www.google.com/search?q=minecraft", headers=headers)
soup = BeautifulSoup(html.text, "lxml")
for result in soup.select(".tF2Cxc"):
title = result.select_one(".DKV0Md").text
link = result.select_one(".yuRUbf a")["href"]
displayed_link = result.select_one(".lEBKkf span").text
snippet = result.select_one(".lEBKkf span").text
print(f"{title}\n{link}\n{displayed_link}\n{snippet}\n")
# part of the output
'''
Log in | Minecraft
https://minecraft.net/login
https://minecraft.net › login
Still have a Mojang account? Log in here: Email. Password. Forgot your password? Login. Mojang © 2009-2021. "Minecraft" is a trademark of Mojang AB.
What is Minecraft? | Minecraft
https://www.minecraft.net/en-us/about-minecraft
https://www.minecraft.net › en-us › about-minecraft
Prepare for an adventure of limitless possibilities as you build, mine, battle mobs, and explore the ever-changing Minecraft landscape.
'''
.post-card-title
CSS
selector in Devtools Console:$$(".post-card-title")
(7) [h2.post-card-title, h2.post-card-title, h2.post-card-title, h2.post-card-title, h2.post-card-title, h2.post-card-title, h2.post-card-title]
0: h2.post-card-title
1: h2.post-card-title
2: h2.post-card-title
3: h2.post-card-title
4: h2.post-card-title
5: h2.post-card-title
6: h2.post-card-title
length: 7
[[Prototype]]: Array(0)
import requests, lxml
from bs4 import BeautifulSoup
html = requests.get("https://serpapi.com/blog/")
soup = BeautifulSoup(html.text, "lxml")
for title in soup.select(".post-card-title"):
print(title.text)
'''
Scrape Google Carousel Results with Python
SerpApi’s YouTube Search API
DuckDuckGo Search API for SerpApi
Extract all search engines ad results at once using Python
Scrape Multiple Google Answer Box Layouts with Python
SerpApi’s Baidu Search API
How to reduce the chance of being blocked while web scraping search engines
'''
CSS
selector with either SelectorGadget or DevTools Console:import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36"
}
html = requests.get("https://dev.to/", headers=headers)
soup = BeautifulSoup(html.text, "lxml")
for result in soup.select(".crayons-story__title"):
title = result.text.strip()
link = f'https://dev.to{result.a["href"].strip()}'
print(title, link, sep="\n")
# part of the output:
'''
How to Create and Publish a React Component Library
https://dev.to/alexeagleson/how-to-create-and-publish-a-react-component-library-2oe
A One Piece of CSS Art!
https://dev.to/afif/a-one-piece-of-css-art-225l
Windster - Tailwind CSS admin dashboard interface [MIT License]
https://dev.to/themesberg/windster-tailwind-css-admin-dashboard-interface-mit-license-3lb6
'''
CSS
selectors are pretty easy and straightforward to understand, just a matter of practice and trial and error (programming in a nutshell 💻)