Introduction to cURL: A Light and Powerful Web Scraping Tool

cURL is an essential building block of the web scraping pipeline. In this article, we're diving deep into how this utility works — and how to combine it with proxies to improve data collection.

Introduction to cURL: A Light and Powerful Web Scraping Tool
Article content
  1. What is cURL?
  2. cURL usage crash course
  3. Using cURL with proxies for better data gathering

Over the last years, we’ve been exploring the ins and outs of data collection: How to better understand the difference between web scraping and web crawling, how to collect data responsibly, how to use residential proxies, and so on. Although web scraping often looks like magic (we input a command and get structured data from any place on the web in return!), it’s really just a a number of building blocks — internet protocols, web crawlers, utilities, and more — that work together to make data collection possible.

One of these building blocks is cURL, a command line tool for transferring data. In this article, we’re taking a closer look at this utility and exploring how it works, its pros and cons, and how to use it to gather data effectively.

What is cURL?

Here’s a quick definition: cURL is an open-source command-line utility for transferring data via the URL syntax. In return, this definition holds a few key terms that can help us understand cURL even better — let’s explore them in greater detail:

Multiple developers contribute to cURL

Open-source: cURL isn’t a proprietary program that you have to pay for — instead, it's a free project maintained by the programming community. As Everything curl, the most extensive cURL guide, describes it:

A funny detail about Open Source projects is that they are called "projects", as if they were somehow limited in time or ever can get done. The cURL "project" is a number of loosely coupled individual volunteers working on writing software together with a common mission: to do reliable data transfers with Internet protocols, and give the code to anyone to use for free.

This means that, upon reading the article, you can visit cURL’s page on GitHub and contribute to the project, adding new features or fixing some pesky bugs.

Command-line and graphical user interfaces

Command-line: cURL doesn’t have a GUI (graphical user interface) — it only has CLI (command-line interface.) This means that you can’t use this software via interacting with its graphical elements like buttons or drop-down menus. Instead, you have to run cURL inside your terminal — special software for command-line applications.

Making the transition from GUIs to CLIs can be disorienting, but it can also increase your productivity: CLIs allow you to execute multiple commands at the same time.

Complex programs like Photoshop and Word and a light utility like cURL

Utility: cURL isn’t a full-blown desktop application — it’s a lightweight utility. Thanks to its small size, different OS manufacturers include cURL in their products like Windows (version 1803 or later), macOS, and Linux distros.

Transferring data: cURL is only designed for one thing — transferring data — but it does this one thing exceptionally well.

URL: cURL uses URLs to navigate

cURL’s full name — client URL — gives us a hint of how it works: It has something to do with URLs. Let's take a closer look at these components.

cURL usage crash course

Since cURL is a command-line utility, you’ll need a terminal application to run it: Some good options are PowerShell in Windows and Terminal in macOS. Open your terminal app and type `curl`: If the utility is properly installed, you’ll get the default welcoming message:

curl: try 'curl --help' or 'curl --manual' for more information.

Sending requests

At its most basic level, cURL only needs a URL to access its data. Here’s an example:

curl www.website.com

Running this command will provide you with website.com’s files.

In the previous section we mentioned that command-line utilities allow you to chain multiple commands and run them simultaneously — this is why a typical cURL prompt looks like this:

curl [options] [URL]

Here’s a real-world example:

curl --ftp-ssl ftp://website.com/cat.txt

In the command above, we’re specifying a protocol (--ftp-ssl) to get a file titled cat.txt.

cURL supports a myriad of data transfer protocols — DICT, FILE, FTP, FTPS, GOPHER, HTTP, HTTPS, IMAP, IMAPS, LDAP, LDAPS, MQTT, POP3, POP3S, RTMP, RTMPS, RTSP, SCP, SFTP, SMB, SMBS, SMTP, SMTPS, TELNET, and TFTP, — so sending a request is pretty easy. To change the protocol from the default HTTP, specify it when running a command:

curl ftp://website.com

Since data transfer mostly involves sending and receiving packets, we’ll need to append the corresponding flags to the commands we run. To send a packet, we need to use the POST method, which is marked by the -d (data) attribute. Here’s how to enter user James using the password 12345:

curl -d "user=James&pass=12345&id=test_id&ding=submit" http://www.website.com/getthis/post.cgi

Using cURL with proxies for better data gathering

Although you won’t have any problems with small-scale web scraping projects, increasing the amount of requests may trigger anti-bot systems that some websites employ to protect themselves. This may result in IP bans, cutting off access to the given website completely. This is where proxies come to the rescue.

Proxy servers are server software that acts as the middleman between the user and the website. This offers a number of advantages: Better privacy and anonymity, among other things — and it also makes web scraping easier because it masks you (or your web scraping bots) and helps to avoid IP bans.

The curl man command provides a helpful guide of cURL’s commands and their usage details. Here’s what it has to say about using proxies:

-x, --proxy <[protocol://][user:password@]proxyhost[:port]>
Use the specified HTTP proxy. If the port number is not specified, it is assumed at port 1080.

Thankfully, we can easily use cURL together with proxies: We only need to add a flag and its attributes — they will define the proxy settings — while the rest of the command stays the same:

curl --proxy proxy:port -U “username:password” https://website.com

In addition, you can set specific user agents when performing requests with cURL and proxies. As explained in another article in our blog, changing user agents can be beneficial because it makes the requests seem more natural (i.e. as if a real user is sending them.)

curl --proxy-header "User-Agent: Mozilla/5.0" -x proxy https://example.com/


Denis Kryukov

Denis Kryukov

Denis Kryukov is using his data journalism skills to document how liberal arts and technology intertwine and change our society

Get In Touch

Have a question about Infatica? Get in touch with our experts to learn how we can help.