Web scraping legality is a complicated topic: With the rise of digital services in the last 20 years, more and more governments are wondering if data extraction should be regulated – and what measures should be introduced. In this article, we’re answering the “Is web scraping illegal?” question and explaining the nuances of legality of scraping websites.
Disclaimer: This article is an analysis of legal practice related to web scraping and web crawling— it's not legal advice. We encourage you to contact web scraping law professionals to review each web scraping/crawling project and what kind of data you deal with.
Web Scraping Mythbusting
To discuss the data scraping legality, we can start with the most common misconceptions that many web scraper users share. Every now and then, we hear that “web scraping is hacking” – or, alternatively, that all data on the web is for the taking and republishing. The reality is more complex, so let’s tackle the biggest myths.
Web scraping is illegal
As of 2022, there are no explicit laws or rules for web scraping that operate on a blanket policy level. At its core, parsing web data is perfectly legal: instead of accessing it via regular human-friendly interfaces (e.g. from a web browser on a mobile device), you’re automating this process via scrapers. Still, the devil is in the details: how you conduct web scraping is important – and it’s the basis for most ethical and legal arguments for/against data collection.
Web scrapers operate in a gray area of law
Not really – although there is indeed no “data collection constitution” to govern each and every website scraping legal case, there are still:
- Region-specific laws like General Data Protection Regulation (GDPR) in Europe; California Consumer Privacy Act (CCPA) and California Privacy Rights Act (CPRA) in California, US, and more.
- Website-specific regulations like terms of service, privacy policies, etc.
Legitimate web scraping companies respect these regulations because they’re building their business model to be sustainable – and operating in a gray area isn’t the best way of doing it.
Web scraping is hacking
Not at all. Although both hacking and web scraping, in the end, want to obtain data, the vital difference between them is the data’s status: While web scraping operates on publicly available data, hacking typically involves exploiting the system’s vulnerabilities, therefore committing computer fraud.
Both hackers and web scraping service providers use the same tools – bots – but their intentions are different: Hacking bots are typically used for DDoS attacks, web scraping bots simply collect data without breaking any web scraping laws.
Web scrapers are stealing data
Web scrapers parse publicly available data – generally, this doesn’t constitute stealing in regulations like GDPR or CPPA. Moreover, defining stealing is much easier with material goods: a stolen apple means net negative for the store as the apple is gone forever. Stolen (or parsed ethically) data, on the other hand, becomes a copy of itself, which doesn’t always mean that the data owner suffered damage.
Still, some platforms claim that accessing their data via automation is stealing – we’ll take a closer look at their arguments later on.
How to Gather Data Ethically
To reiterate the disclaimer from the top of the article: this is not legal advice; rather, these are general guidelines that can help you set up an ethical data collection pipeline. The key to a sustainable web scraping service is good relationship with website owners: In the long run, an ethical scraper will remain operational when its counterparts get blocked.
ToS and robots.txt: The Terms of Service page can provide a great overview of web scraping dos and don'ts for the given website. The robots.txt file plays a similar role.
Crawl rate: Choose a proper crawl rate (i.e. the number of requests per minute.) Different websites have different crawl rate preferences – in most instances, you can find them in robots.txt. If the desired crawl rate isn’t specified there, a good practice is sending 1 request per 10-15 seconds.
User agent string: Add this string type to your crawler to show that this isn’t a “bad bot”. Additionally, this string can link to a page with your justification for collecting the given data set.
Communication: Sometimes, you may be unsure if your scraping is in line with the website’s ToS. In this case, it’s better to communicate your intentions to the website owner: Even though you can still bypass their restrictions via proxies, collaboration is better than confrontation.
Official API: Last but not least, you can use the website’s official API (if one exists.) Although it may be more limiting in terms of requests per minute, it may offer better functionality – and end up an overall much easier way of collecting the website’s data.
How About Scraping Personal Data
With the rise of platforms like Facebook, LinkedIn, and Twitter, more and more people started to understand the importance of their personal data and web crawling’s legal issues: As companies profited from user data and malefactors used it to commit identity fraud, governments of the European Union and California introduced regulations to address these problems.
As their names suggest, General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA) are documents governing the collection and use of personal data. Generally, these documents apply if your users’ data is in said region or your company operates in said region. Although these regulations may appear strictly region-specific, legal practice shows that non-EU companies are adopting GDPR due to its popularity and comprehensiveness just to be safe. As of 2022, GDPR is poised to become the one-size-fits-all data collection law.
What is personal data anyway?
Scraping personal data is ill-advised without understanding what this data actually includes. Let’s see how GDPR interprets it:
Personal data is any information which is related to an identified or identifiable natural person.
Although this definition is somewhat vague, it does highlight that personal data can include a myriad of identifiers: name, surname, location, address, telephone, credit card number, personnel number, account data, number plate, customer number, and more.
Publicly available personal data
It’s tempting to think that public data, being public, is ultimately free from any regulation – and that GDPR or CPPA are only concerned with private data. In reality, this is where the two sets of law have different viewpoints:
Under CPPA, there’s the “publicly available” data subtype. Web scraping companies argue that LinkedIn profiles, for instance, consist of publicly available data. Platform owners, however, have taken this issue to court: in LinkedIn v. hiQ Labs, LinkedIn claimed that hiQ Labs, a small data analytics company, had no right scraping its data. Ultimately, the court’s decision was in favor of hiQ Labs:
It is likely that when a computer network generally permits public access to its data, a user’s accessing that publicly available data will not constitute access without authorization under the CFAA (Computer Fraud and Abuse Act.)
Under GDPR, publicly available personal data is still personal, so the regulation’s protection measures apply to every subtype of personal data.
How to scrape personal data
Like any data collection endeavor, you should start with analyzing the data’s legal aspect: Where is my company located? Where are the users located? Which jurisdiction applies in this case? The best practice is contacting a lawyer to create a foolproof web scraping template, taking care of every detail to avoid possible problems.
Another good idea is taking a closer look at GDPR, which is evolving into a global legislation, with more and more non-European companies adopting it. Even if your business isn’t in the EU and your target users aren’t in the EU, you can still benefit from understanding your users’ expectations regarding their personal data.
Copyrighted Content Scraping
Most web content creates value for its consumers, so it’s only natural that it’s subject to copyright: image files, video files, audio files, and text are probably the most common web scraping targets – and they’re typically all copyrighted content, save for some datasets whose license explicitly allows web scraping. Sure, plain facts are exempt from copyright – but scraping facts isn’t the most exciting thing to do.
Copyrightable data, by definition, prevents you from copying its content without the author’s permission or other sorts of legal basis. Here’s the problem: web scraping is copying content – and seeking individual permissions for each instance of content might just take too much time. The alternative is justifying your data collection via a legal basis – let’s see what the regulations have to say.
Data collection in the European Union
Copyrighted data of EU citizens is managed by the Digital Single Market Directive (also known as the DSM directive), which defines text and data mining as…
Any automated analytical technique aimed at analysing text and data in digital form in order to generate information which includes, but is not limited to, patterns, trends and correlations;
… and provides room for legal web scraping! The caveat, of course, is as follows: you need to generate value via web scraping. This means that parsing a set of e-books to use natural language processing to analyze them is OK – but doing the same to simply re-publish them isn’t allowed.
Fair use in the United States
The fair use doctrine expands on the idea of making scraping websites legal. Similarly to its European counterpart, this document requires collected data to generate value and offer a new and meaningful way of interacting with said content.
Do not publish original works
Both the American and European legislation highlight the need to appreciate the author’s business model: (re-) publishing original works doesn’t create anything new and only confuses potential users. By doing this, a scraping service transforms into a piracy service – and any company will be quick to send a cease-and-desist letter if they notice that their content is being stolen.
This European legislation has made it easier for web scraping service providers to differentiate between willing and unwilling (when it comes to data collection) website owners. If the owner doesn’t want to allow scraping for their website, they can specify it in a machine-readable format (with the robots.txt file being a prime example of a machine-readable ToS page.) With this change, web scraping companies don’t have to worry about legal repercussions and circumvention measures – their scraping tool will be notified if the target website wants to restrict crawling.
CFAA and criminal liability in the US
A US-specific data collection problem is caused by the Computer Fraud and Abuse Act (CFAA), which some website owners use to answer the “Is it legal to scrape data from websites?” question with a resounding “Yes.” Their main argument relies on CFAA’s phrase “without authorization” (similar to trespassing in real world), which isn’t clearly defined and whose meaning has been contested in various legal battles.
April 18, 2022 was an important day for the legality of web scraping: In a decision by the US Supreme Court, accessing data without authorization was interpreted as “obtaining information from computer networks or databases to which their computer access does not extend.” This decision is confirming that scraping publicly available data doesn’t violate scraping laws: The court explained that public websites, by design, do not impose limitations on access, so CFAA norms cannot be applied here.
So, Is Data Scraping Legal?
Despite the recent judicial developments in this field, scraping data from websites still comes with a few caveats: despite GDPR’s popularity, it’s only a region-specific regulation that companies in the Americas, Asia, Africa, and Oceania can ignore. Moreover, non-tech website owners may have a biased opinion regarding the legality of data scraping.
Nevertheless, more and more court decisions are being ruled in favor of web scraping – and, more importantly – website owners are accepting this change as well.