Scraper API Functionality
1. Basic Request
What it does: The simplest function—it retrieves the HTML code of a page from a specified URL. This works for sites that don't require JavaScript to display their content.
cURL Example:
curl -X POST 'http://example.com:8000/' \
-H 'Content-Type: application/json' \
-H 'x-api-key: api_key' \
-d '{
"url": "https://www.whatismybrowser.com/detect/what-is-my-user-agent/"
}'
What happens: The service makes a request to whatismybrowser.com and returns its HTML. The site's response will show that the request came from our service's default User-Agent (e.g., Chrome on Linux).
2. JavaScript Rendering (Headless Browser)
What it does: It loads a page in a full virtual browser, executes all JavaScript, and then returns the final HTML. This uses the /render
endpoint.
cURL Example:
curl -X POST 'http://example.com:8000/render' \
-H 'Content-Type: application/json' \
-H 'x-api-key: api_key' \
-d '{
"url": "https://www.whatismybrowser.com/detect/is-javascript-enabled/"
}'
What happens: The service loads the page in a browser where JavaScript confirms it is active. The final HTML will state "Yes," proving that the JS code was successfully executed. This is essential for scraping modern websites (SPAs).
3. Geotargeting
What it does: Allows you to make requests through proxy servers located in a specific country. The target site will "see" the request as if it came from a local user.
cURL Example (request from France):
curl -X POST 'http://example.com:8000/' \
-H 'Content-Type: application/json' \
-H 'x-api-key: api_key' \
-d '{
"url": "https://ipinfo.io/json",
"country": "FR"
}'
What happens: The request to ipinfo.io (an IP detection service) is routed through a French proxy. The JSON response from ipinfo.io will correctly show "country": "FR"
.
4. Device Emulation
What it does: Allows you to change the User-Agent to impersonate a mobile device or other types of clients.
cURL Example (mobile device):
curl -X POST 'http://example.com:8000/' \
-H 'Content-Type: application/json' \
-H 'x-api-key: api_key' \
-d '{
"url": "https://www.whatismybrowser.com/detect/what-is-my-user-agent/",
"device": "mobile"
}'
What happens: The service will send the request with a User-Agent typical for a mobile phone. The target site will return its mobile-optimized version, if available.
5. Sticky Sessions
What it does: Ensures that the IP address, User-Agent, and cookies remain constant across a series of requests linked by a single session_id. This emulates a single user browsing a website.
cURL Example (in 2 steps):
Step 1: Visit a page to receive a cookie.
curl -X POST 'http://example.com:8000/' \
-H 'Content-Type: application/json' \
-H 'x-api-key: api_key' \
-d '{
"url": "https://httpbin.org/cookies/set?mycookie=123",
"session_id": "user-session-abc-123"
}'
Step 2: Go to another page with the same session to check the cookie.
curl -X POST 'http://example.com:8000/' \
-H 'Content-Type: application/json' \
-H 'x-api-key: api_key' \
-d '{
"url": "https://httpbin.org/cookies",
"session_id": "user-session-abc-123"
}'
What happens: The first request sets a cookie mycookie=123
within the user-session-abc-123
session. The second request to a different page using the same session_id automatically includes the stored cookie. The response from httpbin.org will show that it received mycookie=123
.
6. Custom HTTP Method (e.g., POST)
What it does: Allows sending data to a server using different HTTP methods, like POST, and including a request body.
cURL Example:
curl -X POST 'http://example.com:8000/' \
-H 'Content-Type: application/json' \
-H 'x-api-key: api_key' \
-d '{
"url": "https://httpbin.org/post",
"method": "POST",
"body": "{\"login\":\"user\",\"pass\":\"123\"}",
"headers": {"Content-Type": "application/json"}
}'
What happens: The service sends a POST request (not GET) to httpbin.org/post
with the specified JSON body. This is used for submitting forms, logging in, or interacting with other sites' APIs.
7. Automatic /auto Mode
What it does: An intelligent endpoint that automatically determines the best strategy to retrieve data from a website. It combines the necessary tools like JS rendering and proxy management to maximize success rates, especially on sites with unknown or dynamic protections.
cURL Example:
curl -X POST 'http://example.com:8000/auto' \
-H 'Content-Type: application/json' \
-H 'x-api-key: api_key' \
-d '{
"url": "https://www.crunchbase.com/organization/openai"
}'
What happens: The user doesn't need to figure out how to get the data. The /auto
endpoint handles the complex process automatically and returns the final HTML.
8. Page Screenshotting
What it does: In addition to retrieving HTML, the service can capture graphical screenshots of web pages. This is useful for visually verifying content, archiving the appearance of pages, or analyzing elements that are difficult to parse.
cURL Example:
curl -X POST 'http://example.com:8000/render' \
-H 'Content-Type: application/json' \
-H 'x-api-key: api_key' \
-d '{
"url": "https://www.canadagoose.com/de/de/pr/vapor-jacke-1535UCD.html",
"screenshot": true
}' \
-o response.json
What happens: The service loads the page in a virtual browser, takes a screenshot, and returns the image as a Base64-encoded string within the JSON response. This Base64 string can then be easily decoded and saved as an image file (e.g., .png
).
9. Infrastructure & Monitoring
Our Setup:
- All services are dockerized, ensuring consistent environments and easy deployment.
- We have integrated logging with Grafana, where all service logs are sent in real-time for observability.
Current Work:
- We are actively improving our log labels for more precise filtering and analysis.
- We are in the process of creating comprehensive dashboards to monitor key performance indicators (KPIs) and operational metrics.
10. Antiban Updates
This is an overview of our current capabilities in bypassing modern anti-bot systems.
Supported Systems: We have developed solutions to handle several major anti-bot vendors. Currently, we support:
- Cloudflare: We have a robust solver for various Cloudflare challenges. This has been tested and confirmed working on sites like Crunchbase and FastPeopleSearch.
- Akamai: Our system can bypass Akamai's bot detection measures.
- PerimeterX: We offer partial support, successfully handling challenges that do not require solving a visual CAPTCHA.
Session Reusing for Efficiency: The sticky session functionality (session_id
) is a key part of our antiban strategy. By reusing a previously successful session (with its solved challenges and cookies), users can make subsequent requests significantly faster and more reliably. This reduces the need to launch a full browser for every request, saving resources and increasing stability.
Tests of sticky sessions on Crunchbase (cloudflare) show that we can reuse session at least 10 times after rendering.
Issues:
- Kasada: The bypass for this site was previously functional. However, due to a recent update on their end, it is currently not working.