23
loading...
This website collects cookies to deliver better user experience
pip install playwright
playwright install
response
parameter contains the status, URL, and content itself. And that's what we'll be using instead of directly scraping content in the HTML using CSS selectors.page.on("response", lambda response: print(
"<<", response.status, response.url))
from playwright.sync_api import sync_playwright
url = "https://www.auction.com/residential/ca/"
with sync_playwright() as p:
browser = p.firefox.launch()
page = browser.new_page()
page.on("response", lambda response: print(
"<<", response.status, response.url))
page.goto(url, wait_until="networkidle", timeout=90000)
print(page.content())
page.context.close()
browser.close()
auction.com
will load an HTML skeleton without the content we are after (house prices or auction dates). They will then load several resources such as images, CSS, fonts, and Javascript. If we wanted to save some bandwidth, we could filter out some of those. For now, we're going to focus on the attractive parts.if ("v1/search/assets?" in response.url)
.<< 407 https://www.auction.com/residential/ca/
<< 200 https://www.auction.com/residential/ca/
<< 200 https://cdn.auction.com/residential/page-assets/styles.d5079a39f6.prod.css
<< 200 https://cdn.auction.com/residential/page-assets/framework.b3b944740c.prod.js
<< 200 https://cdn.cookielaw.org/scripttemplates/otSDKStub.js
<< 200 https://static.hotjar.com/c/hotjar-45084.js?sv=5
<< 200 https://adc-tenbox-prod.imgix.net/resi/propertyImages/no_image_available.v1.jpg
<< 200 https://cdn.mlhdocs.com/rcp_files/auctions/E-19200/photos/thumbnails/2985798-1-G_bigThumb.jpg
# ...
wait_for_selector
function. It is not the ideal solution, but we noticed that sometimes the script stops altogether before loading the content. To avoid those cases, we change the waiting method."h4[data-elm-id]"
.with sync_playwright() as p:
def handle_response(response):
# the endpoint we are insterested in
if ("v1/search/assets?" in response.url):
print(response.json()['result']['assets']['asset'])
# ...
page.on("response", handle_response)
# really long timeout since it gets stuck sometimes
page.goto(url, timeout=120000)
page.wait_for_selector("h4[data-elm-id]", timeout=120000)
[
{
"item_id": "E192003",
"global_property_id": 2981226,
"property_id": 5444765,
"property_address": "13841 COBBLESTONE CT",
"property_city": "FONTANA",
"property_county": "San Bernardino",
"property_state": "CA",
"property_zip": "92335",
"property_type": "SFR",
"seller_code": "FSH",
"beds": 4,
"baths": 3,
"sqft": 1704,
"lot_size": 0.2,
"latitude": 34.10391,
"longitude": -117.50212,
...
TweetDetail.
In cases like this one, the easiest path is to check the XHR calls in the network tab in devTools and look for some content in each request. It is an excellent example because Twitter can make 20 to 30 JSON or XHR requests per page view.import json
from playwright.sync_api import sync_playwright
url = "https://twitter.com/playwrightweb/status/1396888644019884033"
with sync_playwright() as p:
def handle_response(response):
# the endpoint we are insterested in
if ("/TweetDetail?" in response.url):
print(json.dumps(response.json()))
browser = p.firefox.launch()
page = browser.new_page()
page.on("response", handle_response)
page.goto(url, wait_until="networkidle")
page.context.close()
browser.close()
from playwright.sync_api import sync_playwright
url = "https://www.nseindia.com/market-data/live-equity-market"
with sync_playwright() as p:
def handle_response(response):
# the endpoint we are insterested in
if ("equity-stockIndices?" in response.url):
item = response.json()['data'][1]
print(item['symbol'], item['lastPrice'])
browser = p.firefox.launch()
page = browser.new_page()
page.on("response", handle_response)
page.goto(url, wait_until="networkidle")
page.context.close()
browser.close()
# Output:
# NIFTY 50 18125.4
# ICICIBANK 846.75
# AXISBANK 845
# ...