25
loading...
This website collects cookies to deliver better user experience
pip install playwright
playwright install
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto("https://www.zenrows.com/")
print(page.title())
# Web Scraping API & Data Extraction - ZenRows
page.context.close()
browser.close()
page.on("request", lambda request: print(
">>", request.method, request.url,
request.resource_type))
page.on("response", lambda response: print(
"<<", response.status, response.url))
page.goto("https://www.zenrows.com/")
# >> GET https://www.zenrows.com/ document
# << 200 https://www.zenrows.com/
# >> GET https://cdn.zenrows.com/images_dash/logo-instagram.svg image
# << 200 https://cdn.zenrows.com/images_dash/logo-instagram.svg
page
also exposes a method route
that will execute a handler for each matching route or pattern. Let's say that we don't want SVGs to load. Using a pattern like "**/*.svg"
will match requests ending with that extension. As for the handler, we need no logic for the moment, only to abort the request. For that, we'll use a lambda and the route
param's abort method.page.route("**/*.jpg", lambda route: route.abort())
page.goto("https://www.zenrows.com/")
"**/*.{png,jpg,jpeg}"
should work, but we found otherwise. Anyway, it's doable with the next blocking strategy.import re
# ...
page.route(re.compile(r"\.(jpg|png|svg)$"),
lambda route: route.abort())
page.goto("https://www.zenrows.com/")
route
param exposed in the lambda function above includes the original request and resource type. And one of those types is image
, perfect! You can access the whole resource type list."**/*"
) and add conditional logic to the lambda function. In case it is an image, abort the request as before. Else, continue with it as usual.page.route("**/*", lambda route: route.abort()
if route.request.resource_type == "image"
else route.continue_()
)
page.goto("https://www.zenrows.com/")
excluded_resource_types = ["stylesheet", "script", "image", "font"]
def block_aggressively(route):
if (route.request.resource_type in excluded_resource_types):
route.abort()
else:
route.continue_()
# ...
page.route("**/*", block_aggressively)
page.goto("https://www.zenrows.com/")
routes.request
, the original URL, the headers, and several other info are available.document
type. That will effectively prevent anything but the initial HTML from being loaded.def block_aggressively(route):
if (route.request.resource_type != "document"):
route.abort()
else:
route.continue_()
new_page
method. As easy as that.page = browser.new_page(record_har_path="playwright_test.har")
page.goto("https://www.zenrows.com/")
navigationStart
and loadEventEnd
. When blocking, it should be under half a second (i.e., 346ms); for regular navigation, above a second or even two (i.e., 1363ms).page.goto("https://www.zenrows.com/")
print(page.evaluate("JSON.stringify(window.performance)"))
# {"timing":{"connectStart":1632902378272,"navigationStart":1632902378244, ...
client = page.context.new_cdp_session(page)
client.send("Performance.enable")
page.goto("https://www.zenrows.com/")
print(client.send("Performance.getMetrics"))