Introduction to Puppeteer: Automating Data Collection

Puppeteer is a powerful Node.js library for automation in Chromium-based browsers — let's take a closer look at how it works and how to set it up for web scraping.

Introduction to Puppeteer: Automating Data Collection
Article content
  1. Headless browsers and automation
  2. What is Puppeteer?
  3. What is Puppeteer used for?
  4. Running Puppeteer in the cloud
  5. Using Puppeteer to capture a web page snapshot
  6. Using Puppeteer to get a list of latest Hacker News articles
  7. Using Puppeteer with proxies

Web scraping has been on the rise over the years — with the data science industry booming and producing countless new business opportunities, more and more companies (and individuals, too) are trying web scraping.

Utilities like Puppeteer are an essential component of many web scraping pipelines. In this article, we’re taking a closer look at this software to better understand web scraping automation and how powerful Puppeteer really is in this field.

Headless browsers and automation

In our previous web scraping tutorial that covered cURL, we tried running different commands to send requests and retrieve data manually, i.e. typing each command again and again.

Although this approach is great for getting the feel of new software when you’re just starting out, it’s not sustainable in the long run: Data collection professionals are expected to parse thousands of web pages, so typing each command manually isn’t the best option. This is where web scraping automation comes to rescue: We can write little programs called scripts to let our machine do a lot of routine tasks for us.

What is Puppeteer?

In our overview of Python web crawlers, we mentioned that Puppeteer was a powerful Node.js library for automation in Chromium-based browsers — and that it integrated with web crawlers like Pyspider, too.

Here’s a more detailed explanation, courtesy of Puppeteer’s Readme page on GitHub:

Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default, but can be configured to run full (non-headless) Chrome or Chromium.

The definition above is rich with technical terms, so let’s unpack them to better understand what Puppeteer actually does:

Node.js and different libraries

Node library: Node.js is software that makes it possible to run JavaScript code outside the browser. This is a crucial feature for implementing web scraping functionality as it allows crawler bots to run independently from our browser instance.

A bot connects to Google Chrome

API: The Application Programming Interface is a set of instructions that Puppeteer uses to control Chrome. While human users typically need a mouse/keyboard combo to use the browser, bots can interact with Chrome via its API and do much more.

Regular and headless browsers

Headless: As outlined in our headless browsers overview, this is a web browser that lacks the GUI (graphical user interface.) While human users rely on GUI elements (buttons, scrolls, windows, and so on) to navigate the web, web crawling bots can do so in a text environment, simply following the commands that we provide them. Removing the GUI elements, therefore, is a great optimization trick: They cause unnecessary overhead as bots have no use for them.

What is Puppeteer used for?

As the project’s manual page states, Puppeteer excels at:

  • Turning web pages into PDF files and taking screenshots,
  • Testing user interfaces,
  • Automating form submission,
  • Creating powerful testing environments to check for performance issues,
  • Testing Chrome extensions, and
  • Crawling web pages.

Naturally, when it comes to data collection, we’re most interested in the latter point.

Running Puppeteer in the cloud

Puppeteer running in the cloud

If you only want to play around with Puppeteer without too many commitments (installing Node.js, for instance), here’s some good news: The Puppeteer team created a dedicated web page where you can run Puppeteer in the cloud. Upon opening the cloud version, you’ll be presented with a code snippet — try pressing the RUN IT button to see what happens.

Using Puppeteer to capture a web page snapshot

Puppeteer and a page screenshot

Puppeteer is a Node.js library, so we’ll need to install Node.js first — it’s available for all major operating platforms at their download page.

Although not a prerequisite per se, some JavaScript knowledge will be a plus: Both Puppeteer and a large portion of the web are built with JavaScript, so understanding its syntax will allow you to fine-tune the code snippets you see in this article.

To install pupp onto our machine, we’ll need to run the following command in our terminal app: npm i puppeteer --save: It orders npm (a package manager for Node.js software) to install pupp.

Create a file titled pup-screenshot.js. This neat little code snippet demonstrates Puppeteer’s ability to scrape web pages and take snapshots of them — paste it into pup-screenshot.js.

const puppeteer = require('puppeteer');
const url = process.argv[2];
if (!url) {
    throw "Please provide URL as a first argument";
}
async function run () {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto(url);
    await page.screenshot({path: 'screenshot.png'});
    browser.close();
}
run();

In this instance, we’ll receive a screenshot of example.com's main page if we run the following command: node pup-screenshot.js https://example.com.

Using Puppeteer to get a list of latest Hacker News articles

This code snippet will allow you to retrieve the newest articles from Hacker News:

const puppeteer = require('puppeteer');
function run () {
    return new Promise(async (resolve, reject) => {
        try {
            const browser = await puppeteer.launch();
            const page = await browser.newPage();
            await page.goto("https://news.ycombinator.com/");
            let urls = await page.evaluate(() => {
                let results = [];
                let items = document.querySelectorAll('a.storylink');
                items.forEach((item) => {
                    results.push({
                        url:  item.getAttribute('href'),
                        text: item.innerText,
                    });
                });
                return results;
            })
            browser.close();
            return resolve(urls);
        } catch (e) {
            return reject(e);
        }
    })
}
run().then(console.log).catch(console.error);

Using Puppeteer with proxies

Puppeteer instances connected to a proxy server

As we’ve laid out in this article, data collection relies heavily on automation. This is a blessing and a curse: Although automation allows web scraping professionals to scale their projects almost infinitely, it also introduces problems because websites are generally wary of automatic actions. They protect themselves with anti-scraping systems and block suspicious IP addresses or force them to complete CAPTCHAs.

This problem can be addressed by using residential proxies: They mask and rotate your Puppeteer’s instance IP address, making it much easier to collect data. Here’s a code snippet for enabling a proxy connection in Puppeteer:

'use strict';

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch({
    // Launch chromium using a proxy server on port 9876.
    // More on proxying:
    //    https://www.chromium.org/developers/design-documents/network-settings
    args: [
      '--proxy-server=127.0.0.1:9876',
      // Use proxy for localhost URLs
      '--proxy-bypass-list=<-loopback>',
    ],
  });
  const page = await browser.newPage();
  await page.goto('https://google.com');
  await browser.close();
})();

Here, the following line — '--proxy-server=127.0.0.1:9876', — defines your proxy server’s address and port.


Get In Touch

Have a question about Infatica? Get in touch with our experts to learn how we can help.