How to Scrape Websites with Playwright – Proxy Support & Best Practices

Playwright for Web Scraping: Setup, Proxies, and Best Practices

Jan Wiśniewski 30 May 2025 10 min read

Article content

What Is Playwright?
Why Use Playwright for Web Scraping?
Setting Up Playwright
Integrating Proxies in Playwright
Handling Anti-Scraping Measures
Sample Use Case: Scraping a Dynamic Website
Scaling and Best Practices
Playwright vs. Other Web Scraping Tools
Frequently Asked Questions

Web scraping has evolved far beyond fetching static HTML from server-side rendered pages. Today’s websites are often dynamic, JavaScript-heavy, and built with frameworks like React, Angular, or Vue. That’s where Playwright shines – offering capabilities that make it particularly well-suited for scraping modern web applications. Let’s learn how to use it!

What Is Playwright?

Playwright is a powerful open-source automation library developed by Microsoft that enables developers to programmatically control web browsers. Built on Node.js, Playwright allows scripts to interact with modern web applications in much the same way a real user would – clicking buttons, filling forms, and navigating through pages.

Originally designed for end-to-end testing, Playwright has quickly gained popularity in the web scraping community thanks to its robust browser automation capabilities and built-in support for dynamic content. Unlike traditional scraping tools that work best with static HTML, Playwright excels at handling JavaScript-heavy websites where content loads dynamically or requires user interaction to appear.

Why Use Playwright for Web Scraping?

JavaScript rendering: Playwright launches real browser instances, meaning it can fully render pages, execute JavaScript, and access content that only appears after client-side processing. This gives it a major edge over traditional HTTP-based scrapers that can’t see dynamic content.

Automation-Focused API: With its intuitive and powerful API, Playwright allows you to:

Wait for elements to appear using built-in waiting mechanisms
Interact with page elements (clicks, inputs, scrolls)
Capture screenshots, PDFs, and other media
Extract structured data from rendered pages using selectors

This programmatic control mimics user behavior, helping you bypass common client-side obstacles to scraping.

Multi-browser and headless support: Playwright supports three major browser engines – Chromium, Firefox, and WebKit – allowing you to target a wide range of environments. You can run these in headless mode for speed or headed mode when debugging.

Robust anti-bot evasion: Because it controls real browsers, Playwright can better mimic human interaction than pure HTTP scraping tools. When paired with techniques like user-agent rotation and stealth plugins, it becomes much harder for websites to detect and block scraping activity.

Cross-platform and scalable: Whether you're running scripts locally, deploying them in a CI/CD pipeline, or scaling scraping jobs across servers, Playwright works consistently across operating systems. It's also compatible with Docker, making it easier to build and scale containerized scraping workflows.

Setting Up Playwright

Before you can start scraping with Playwright, you'll need to get the library installed and set up in your environment. The process is straightforward and works across Windows, macOS, and Linux systems.

1. Installation

Playwright is available via npm and supports all major browsers. To get started, open your terminal and run:

npm install playwright

This will install Playwright and download the necessary browser binaries (Chromium, Firefox, and WebKit) the first time you run it. If you want to install only a specific browser, you can use:

npm install playwright-chromium

2. Basic Usage Example

Here's a minimal example that launches a Chromium browser, navigates to a website, and logs the page title. Let’s see how easy it is to automate browsing tasks using Playwright:

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch(); // Use { headless: false } to see the browser
  const page = await browser.newPage();
  await page.goto('https://example.com');
  const title = await page.title();
  console.log(`Page title: ${title}`);
  await browser.close();
})();

3. Running in Headless or Headed Mode

By default, Playwright runs in headless mode (no UI). If you want to observe browser behavior during development, set:

const browser = await chromium.launch({ headless: false });

This opens a visible browser window, making it easier to debug selectors and scraping logic.

4. Using Playwright with Docker

For scalable, consistent deployments – especially in production – you can run Playwright in a Docker container. The Playwright team offers an official Docker image that includes all dependencies and browsers pre-installed:

docker pull mcr.microsoft.com/playwright

You can then run your scraping script in a containerized environment, simplifying infrastructure setup and improving reliability across deployments.

Integrating Proxies in Playwright

When scraping at scale or targeting content that varies by location, proxies are essential. They allow you to route requests through different IP addresses and regions, avoid rate limits, and reduce the risk of being blocked. Playwright supports proxy configuration natively, making it easy to route browser traffic through your proxy provider of choice.

How to Configure a Proxy in Playwright

When launching a browser instance, Playwright allows you to pass a proxy option in the launch configuration. Here's a basic example using a proxy server:

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch({
    proxy: {
      server: 'http://your.proxy.server:port',
      username: 'yourUsername', // optional
      password: 'yourPassword'  // optional
    }
  });

  const page = await browser.newPage();
  await page.goto('https://example.com');
  console.log(await page.title());
  await browser.close();
})();

Choosing the Right Proxy Provider

A reliable proxy network is critical for sustained scraping. Look for a large and diverse IP pool, multiple geolocation options, high uptime and speed, and rotation support and usage transparency. Infatica’s proxy network includes both residential and datacenter proxies – making it easy to scale and geotarget your scraping operations while avoiding common blocks.

Handling Anti-Scraping Measures

Modern websites employ a wide range of defenses to detect and block automated scraping. While Playwright gives you the power to mimic real user behavior, combining it with the right anti-evasion strategies is key to building a resilient scraper.

Headless Detection and Mitigation

Many sites try to detect whether a browser is running in headless mode. To avoid this, launch Playwright in non-headless mode during critical interactions:

const browser = await chromium.launch({ headless: false });

Additionally, randomize browser viewport sizes and user agents and use tools like playwright-extra with stealth plugins to minimize fingerprinting.

User-Agent and Header Spoofing

Default headers often signal automation. Customize them to resemble real browser traffic:

await page.setExtraHTTPHeaders({
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)...',
  'Accept-Language': 'en-US,en;q=0.9'
});

Rotate these headers periodically to simulate diverse user traffic.

CAPTCHAs and Bot Challenges

CAPTCHAs can halt scraping if not addressed. Your options include:

Use services like 2Captcha or Anti-Captcha to solve challenges.
Detect CAPTCHA elements and implement fallback logic (skip, retry, log).
Reduce CAPTCHA triggers with good proxy hygiene and behavior simulation.

Some sites track sessions and cookies to identify bots. Playwright lets you manage sessions easily:

const context = await browser.newContext();
await context.addCookies([{ name: 'session_id', value: 'abc', domain: '.example.com' }]);

Preserve cookies across sessions to appear more consistent and human-like.

Proxy Rotation

One of the most effective evasion tactics is rotating IPs using a proxy pool. This:

Prevents bans tied to specific IPs
Distributes scraping load
Makes traffic appear organic

Infatica’s rotating proxy network integrates seamlessly with Playwright and offers location-specific IPs – helping you maintain access and stay under detection thresholds.

Sample Use Case: Scraping a Dynamic Website

To see Playwright in action, let’s walk through a real-world example: scraping product data from a JavaScript-heavy e-commerce website. Many modern sites load product listings dynamically using client-side rendering, making tools like Playwright ideal for the job.

1. Target: A Dynamic Product Page

Let’s say you want to scrape a list of product names and prices from a search results page that loads data via JavaScript (e.g., electronics or clothing items). Here’s how you might do that with Playwright:

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch({ headless: true });
  const context = await browser.newContext();

  const page = await context.newPage();
  await page.goto('https://example-ecommerce.com/search?q=smartphone');

  // Wait for products to load
  await page.waitForSelector('.product-item');

  // Extract data
  const products = await page.$$eval('.product-item', items => {
    return items.map(item => ({
      name: item.querySelector('.product-title')?.innerText.trim(),
      price: item.querySelector('.product-price')?.innerText.trim()
    }));
  });

  console.log(products);
  await browser.close();
})();

2. Handling Pagination

Many product listings span multiple pages. Playwright allows you to click through pages or dynamically generate URLs:

await page.click('button.next-page'); // Or:
await page.goto(`https://example.com/page=${pageNumber}`);

Looping through pages and aggregating data can scale your results effectively.

3. Error Handling and Retries

Always wrap your scraping logic in try/catch blocks to handle timeouts or missing elements gracefully:

try {
  await page.waitForSelector('.product-item', { timeout: 5000 });
} catch (err) {
  console.warn('Product list did not load in time.');
}

Scaling and Best Practices

Getting a Playwright scraper working is just the beginning. The real challenge lies in scaling it – making sure it runs efficiently, handles thousands of requests, recovers from errors, and doesn’t get blocked. In this chapter, we’ll walk through some of the key lessons and strategies that help turn a working scraper into a production-grade data pipeline.

Parallelizing Your Scraping

One of Playwright’s great strengths is its ability to run multiple browser contexts or pages at once. Rather than waiting for one task to finish before starting the next, you can launch several sessions in parallel. This significantly speeds up data collection, especially when dealing with large datasets or multiple URLs.

To manage this, you can spin up multiple browser contexts or even run multiple worker processes. Promise-based concurrency, queue systems like Redis with BullMQ, or simple thread pools can all help distribute the load effectively:

const context = await browser.newContext();
const page1 = await context.newPage();
const page2 = await context.newPage();

Avoiding Detection with Throttling and Random Delays

As your scraping scales up, so does the risk of detection. Sites often flag bots that hit them too fast or too frequently. That’s why it’s important to slow things down just enough – introducing random delays between requests, rotating user agents, and mimicking real browsing behavior. These subtle changes make your traffic blend in and reduce the chances of being blocked.

Even better, route your requests through proxies to distribute the traffic. A rotating pool of residential or datacenter IPs – such as those offered by Infatica – helps keep your scrapers flying under the radar while expanding your reach across different regions.

Planning for Errors and Outages

No matter how carefully you build your scraper, something will eventually go wrong. A selector might change. A page might time out. A proxy might fail. Instead of trying to avoid all possible issues upfront, build for resiliency.

Catch errors when they occur, retry failed requests, and log everything – successes, failures, and unexpected behavior. This logging is what makes debugging and improving your scraper possible over time.

Keeping an Eye on Performance

As scraping becomes more integral to your operations, monitoring becomes essential. You’ll want to track success rates, identify spikes in failures, and be alerted to bans or major changes in site structure. Whether you build simple dashboards or integrate with tools like Prometheus and Grafana, visibility into how your scrapers are performing helps you stay ahead of problems.

Playwright vs. Other Web Scraping Tools

With so many web scraping tools available today, choosing the right one depends on your use case, technical preferences, and the complexity of the websites you're targeting. Playwright stands out for its modern design, automation power, and robustness against anti-bot systems – but how does it compare to other popular options?

Feature	Playwright	Puppeteer	Selenium	Scrapy
Headless Browser Support	✓ Chromium, Firefox, WebKit	✓ Chromium only (official)	✓ Chrome, Firefox, others	✗ (requires integration)
Cross-Browser Testing	✓ Native	✗	✓ via WebDriver	✗
JavaScript Rendering	✓ Excellent	✓ Good	✓ Moderate	✗
Stealth & Anti-Bot Evasion	✓ Strong (with plugins)	✓ Moderate	✗ Limited	✗ Requires middleware
Performance at Scale	✓ Efficient with parallelism	✓ Lightweight	✗ Slower, heavy processes	✓ Optimized for speed
Language Support	JavaScript, Python, .NET, Java	JavaScript only	Multiple (Python, Java, etc.)	Python only
Ease of Use	✓ Modern, consistent API	✓ Familiar syntax	✗ Verbose setup	✓ Pythonic, declarative

Comparison summary

If your target sites are JavaScript-heavy, Playwright offers arguably the most powerful and modern solution, with broad browser support and anti-bot resistance. Puppeteer is simpler and lighter for Chromium-only tasks. Selenium remains useful for traditional testing and broader language support, but it lags in scraping efficiency. Meanwhile, Scrapy is a great option when speed and structured crawling are the priority – and when JavaScript rendering isn't required.

Frequently Asked Questions

Yes, Playwright natively supports proxy configuration. You can route browser traffic through HTTP, HTTPS, or SOCKS proxies, including authenticated ones, which is essential for IP rotation and accessing geo-restricted content.

Playwright supports multiple browsers (Chromium, Firefox, WebKit) and offers stronger anti-bot evasion features, while Puppeteer focuses mainly on Chromium. Playwright’s cross-browser support makes it more versatile for complex scraping tasks.

Absolutely. Playwright excels at rendering JavaScript-heavy pages by controlling real browsers, allowing you to scrape dynamic content that traditional HTTP request-based scrapers can’t access.

Sites detect bots via headless browsers, unusual traffic patterns, or CAPTCHAs. Playwright’s ability to mimic real user behavior, combined with proxy integration and stealth plugins, helps bypass many anti-scraping defenses.

Yes. Playwright supports parallel browser contexts and can be integrated with proxy rotation and queueing systems, making it scalable and reliable for high-volume scraping when combined with best practices.

Contact Sales

Web scraping

Jan Wiśniewski

Jan is a content manager at Infatica. He is curious to see how technology can be used to help people and explores how proxies can help to address the problem of internet freedom and online safety.

Playwright for Web Scraping: Setup, Proxies, and Best Practices