22
loading...
This website collects cookies to deliver better user experience
tc2.medium
instance for this project. This comes with 2 vCPUs and 4GB of RAM, so it's powerful enough to run Puppeteer, as well as what we're going to add later. An Ubuntu 18.04 instance is a good starting point.sudo apt-get install -y ca-certificates fonts-liberation libappindicator3-1 libasound2 libatk-bridge2.0-0 libatk1.0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 libgbm1 libgcc1 libglib2.0-0 libgtk-3-0 libnspr4 libnss3 libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcomposite1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 lsb-release wget xdg-utils
curl -sL https://deb.nodesource.com/setup_14.x -o nodesource_setup.sh
bash nodesource_setup.sh
apt-get install -y nodejs
{
"name": "agent-function",
"version": "0.0.1",
"dependencies": {
"axios": "^0.19.2", // For communicating with the app server.
"puppeteer": "10.0.0",
"puppeteer-extra": "3.1.8",
"puppeteer-extra-plugin-stealth": "2.7.8"
}
}
npm install
, you should have all the necessary pieces in place. Let's use a very simple Node script to verify that Puppeteer is installed and working.const puppeteer = require("puppeteer-extra");
async function crawl() {
console.log("It worked!!!");
}
puppeteer
.launch({
headless: true,
executablePath:
"./node_modules/puppeteer/.local-chromium/linux-884014/chrome-linux/chrome",
ignoreHTTPSErrors: true,
args: [
"--start-fullscreen",
"--no-sandbox",
"--disable-setuid-sandbox"
]
})
.then(crawl)
.catch(error => {
console.error(error);
process.exit();
});
It worked!!!
print to the console when you execute this script.async function crawl(browser) {
const page = await browser.newPage();
await page.goto("https://www.google.com/?hl=en");
// Find an input with the name 'q' and type the search query into it, while
// pausing 100ms between keystrokes.
const inputHandle = await page.waitForXPath("//input[@name = 'q']");
await inputHandle.type("puppeteer", { delay: 100 });
await page.keyboard.press("Enter");
await page.waitForNavigation();
await page.screenshot({ path: "./screenshot.png" });
await browser.close();
}
hl=en
to request the English version), enters the search query, and presses enter. waitForNavigation
method pauses the script until the browser emits the load event (i.e. the page and all of its resources, such as CSS and images, have loaded). This is important, because we'd like to wait until the results are visible before we take the screenshot.screenshot.png
after running the script.proxy-server
value.puppeteer
.launch({
headless: false,
executablePath:
"./node_modules/puppeteer/.local-chromium/linux-884014/chrome-linux/chrome",
ignoreHTTPSErrors: true,
args: [
`--proxy-server=${proxyUrl}`, // Specifying a proxy URL.
"--start-fullscreen",
"--no-sandbox",
"--disable-setuid-sandbox"
]
})
proxyUrl
might be something like http://gate.dc.smartproxy.com:20000
. Most proxy configurations will require a username and password, unless you're using IP white-listing as an authentication method. You'll need to authenticate with that username/password combination before making any requests.async function crawl(browser) {
const page = await browser.newPage();
await page.authenticate({ username, password });
await page.goto("https://www.google.com/?hl=en");
}
let rankData = [];
while (pages) {
// Find the search result links -- they are children of div elements
// that have a class of 'g', while the links themselves must also
// have an H3 tag as a child.
const results = await page.$x("//div[@class = 'g']//a[h3]");
// Extract the links from the tags using a call to 'evaluate', which
// will execute the function in the context of the browser (i.e. not
// within the current Node process).
const links = await page.evaluate(
(...results) => results.map(link => link.href),
...results
);
const [next] = await page.$x(
"//div[@role = 'navigation']//a[descendant::span[contains(text(), 'Next')]]"
);
rankData = rankData.concat(links);
if (!next) {
break;
}
await next.click();
await page.waitForNavigation();
pages--;
}
axios
.post(`http://172.17.0.1/api/keywords/${keywordID}/callback/`, {
secret_key: secretKey,
proxy_id: proxyID,
results: rankData,
blocked: blocked,
error: ""
})
.then(() => {
console.log("Successfully returned ranking data.");
});
blocked
or error
variables here. We'll get into error handling in a moment. The most important thing here is the rankData
variable, which refers to the list containing all of the search result links.blocked
value. Let's take a look at how we determine whether the scraper has been blocked.let blocked = false;
try {
const [captcha] = await page.$x("//form[@id = 'captcha-form']");
if (captcha) {
console.log("Agent encountered a CAPTCHA");
blocked = true;
}
} catch (e) {}
captcha-form
and sets the blocked
value to true if so. As we'll see later, if a proxy IP is reported as blocked too many times, the app will no longer use that IP address.