Skip to main content

Scraper API Functionality

1. Basic Request

What it does: The simplest function—it retrieves the HTML code of a page from a specified URL. This works for sites that don't require JavaScript to display their content.

cURL Example:

curl -X POST 'http://example.com:8000/' \
-H 'Content-Type: application/json' \
-H 'x-api-key: api_key' \
-d '{
"url": "https://www.whatismybrowser.com/detect/what-is-my-user-agent/"
}'

What happens: The service makes a request to whatismybrowser.com and returns its HTML. The site's response will show that the request came from our service's default User-Agent (e.g., Chrome on Linux).

2. JavaScript Rendering (Headless Browser)

What it does: It loads a page in a full virtual browser, executes all JavaScript, and then returns the final HTML. This uses the /render endpoint.

cURL Example:

curl -X POST 'http://example.com:8000/render' \
-H 'Content-Type: application/json' \
-H 'x-api-key: api_key' \
-d '{
"url": "https://www.whatismybrowser.com/detect/is-javascript-enabled/"
}'

What happens: The service loads the page in a browser where JavaScript confirms it is active. The final HTML will state "Yes," proving that the JS code was successfully executed. This is essential for scraping modern websites (SPAs).

3. Geotargeting

What it does: Allows you to make requests through proxy servers located in a specific country. The target site will "see" the request as if it came from a local user.

cURL Example (request from France):

curl -X POST 'http://example.com:8000/' \
-H 'Content-Type: application/json' \
-H 'x-api-key: api_key' \
-d '{
"url": "https://ipinfo.io/json",
"country": "FR"
}'

What happens: The request to ipinfo.io (an IP detection service) is routed through a French proxy. The JSON response from ipinfo.io will correctly show "country": "FR".

4. Device Emulation

What it does: Allows you to change the User-Agent to impersonate a mobile device or other types of clients.

cURL Example (mobile device):

curl -X POST 'http://example.com:8000/' \
-H 'Content-Type: application/json' \
-H 'x-api-key: api_key' \
-d '{
"url": "https://www.whatismybrowser.com/detect/what-is-my-user-agent/",
"device": "mobile"
}'

What happens: The service will send the request with a User-Agent typical for a mobile phone. The target site will return its mobile-optimized version, if available.

5. Sticky Sessions

What it does: Ensures that the IP address, User-Agent, and cookies remain constant across a series of requests linked by a single session_id. This emulates a single user browsing a website.

cURL Example (in 2 steps):

Step 1: Visit a page to receive a cookie.

curl -X POST 'http://example.com:8000/' \
-H 'Content-Type: application/json' \
-H 'x-api-key: api_key' \
-d '{
"url": "https://httpbin.org/cookies/set?mycookie=123",
"session_id": "user-session-abc-123"
}'

Step 2: Go to another page with the same session to check the cookie.

curl -X POST 'http://example.com:8000/' \
-H 'Content-Type: application/json' \
-H 'x-api-key: api_key' \
-d '{
"url": "https://httpbin.org/cookies",
"session_id": "user-session-abc-123"
}'

What happens: The first request sets a cookie mycookie=123 within the user-session-abc-123 session. The second request to a different page using the same session_id automatically includes the stored cookie. The response from httpbin.org will show that it received mycookie=123.

6. Custom HTTP Method (e.g., POST)

What it does: Allows sending data to a server using different HTTP methods, like POST, and including a request body.

cURL Example:

curl -X POST 'http://example.com:8000/' \
-H 'Content-Type: application/json' \
-H 'x-api-key: api_key' \
-d '{
"url": "https://httpbin.org/post",
"method": "POST",
"body": "{\"login\":\"user\",\"pass\":\"123\"}",
"headers": {"Content-Type": "application/json"}
}'

What happens: The service sends a POST request (not GET) to httpbin.org/post with the specified JSON body. This is used for submitting forms, logging in, or interacting with other sites' APIs.

7. Automatic /auto Mode

What it does: An intelligent endpoint that automatically determines the best strategy to retrieve data from a website. It combines the necessary tools like JS rendering and proxy management to maximize success rates, especially on sites with unknown or dynamic protections.

cURL Example:

curl -X POST 'http://example.com:8000/auto' \
-H 'Content-Type: application/json' \
-H 'x-api-key: api_key' \
-d '{
"url": "https://www.crunchbase.com/organization/openai"
}'

What happens: The user doesn't need to figure out how to get the data. The /auto endpoint handles the complex process automatically and returns the final HTML.

8. Page Screenshotting

What it does: In addition to retrieving HTML, the service can capture graphical screenshots of web pages. This is useful for visually verifying content, archiving the appearance of pages, or analyzing elements that are difficult to parse.

cURL Example:

curl -X POST 'http://example.com:8000/render' \
-H 'Content-Type: application/json' \
-H 'x-api-key: api_key' \
-d '{
"url": "https://www.canadagoose.com/de/de/pr/vapor-jacke-1535UCD.html",
"screenshot": true
}' \
-o response.json

What happens: The service loads the page in a virtual browser, takes a screenshot, and returns the image as a Base64-encoded string within the JSON response. This Base64 string can then be easily decoded and saved as an image file (e.g., .png).

9. Infrastructure & Monitoring

Our Setup:

  • All services are dockerized, ensuring consistent environments and easy deployment.
  • We have integrated logging with Grafana, where all service logs are sent in real-time for observability.

Current Work:

  • We are actively improving our log labels for more precise filtering and analysis.
  • We are in the process of creating comprehensive dashboards to monitor key performance indicators (KPIs) and operational metrics.

10. Antiban Updates

This is an overview of our current capabilities in bypassing modern anti-bot systems.

Supported Systems: We have developed solutions to handle several major anti-bot vendors. Currently, we support:

  • Cloudflare: We have a robust solver for various Cloudflare challenges. This has been tested and confirmed working on sites like Crunchbase and FastPeopleSearch.
  • Akamai: Our system can bypass Akamai's bot detection measures.
  • PerimeterX: We offer partial support, successfully handling challenges that do not require solving a visual CAPTCHA.

Session Reusing for Efficiency: The sticky session functionality (session_id) is a key part of our antiban strategy. By reusing a previously successful session (with its solved challenges and cookies), users can make subsequent requests significantly faster and more reliably. This reduces the need to launch a full browser for every request, saving resources and increasing stability.

Tests of sticky sessions on Crunchbase (cloudflare) show that we can reuse session at least 10 times after rendering.

Issues:

  • Kasada: The bypass for this site was previously functional. However, due to a recent update on their end, it is currently not working.