It’s often difficult to understand whether we’re breaking the law or not when we’re performing web crawling and/or web scraping as there are yet no concrete rules about the Internet. Although governments are implementing certain regulations related to user privacy and intellectual rights, the virtual world is developing way too fast. When governments figured that they need to do something about it, it was too late.
That’s why you may have countless questions about the legality of your online activity — and the worst thing is that it’s impossible to find all the answers because there isn't a single source of truth on this topic. So, what do we do about web scraping and web crawling? Is it legal to gather data online? Let’s figure it out.
Disclaimer: This article is an analysis of legal practice related to web scraping and web crawling — it's not legal advice. We encourage you to contact law professionals to review each web scraping/crawling project on a case-by-case basis.
The difference between scraping and crawling
Before we dive deep into this topic, let's clarify the difference between these two terms. It’s important to understand that these two activities, although similar, do have some important differences.
Web crawling is the process of parsing target web pages and gathering information from them. Crawling is part and parcel of web scraping. However, it's also used as a standalone process by search engines. They send special bots — crawlers — to websites in order to determine what the given page is about. Based on this data, search engines decide how to rank the website, and when to show it to users as they search for something.
Web scraping is a more complex process. First, it performs crawling — gathering of data. Then, there is the second step — the analysis and formatting of gathered information. When web scraping is finished, you will have sifted out bits of relevant and genuine data you were looking for.
Is it (il)legal to gather data?
Both web crawling and web scraping involve data gathering with web scraping being able to offer precise results. In most cases, these processes are used for legal needs such as business intelligence or scientific research. Sometimes, however, malefactors can use web scraping with evil intentions — and that’s something we need to talk about right away. If you’re using web scraping to gather data that will later help you perform something illegal, then your activity is illegal by default.
With that set aside, let’s see if your web scraping activity might break any laws even if you’re using it with legal intentions.
You’re breaking the law if your scraper logs into a person's account
In this case, even if you’re going to use the gathered information only for personal needs, you still are violating the rules set out by the website owner because they’ve forbidden you to gather data from their site.
You’re breaking the law if you’re gathering copyrighted data
We might think that if the information is available online for a broad audience, it can be used however we want — but that’s a dangerous oversimplification. Legally, all the content published by someone belongs to that person. So, if your aunt posted a long-read on her Facebook page, despite the topic of this post, the content belongs solely to her. And if someone copies and pastes this text somewhere else without referring to your aunt, she has the right to sue that person for stealing her intellectual property.
This principle applies even to the most insignificant Facebook post. Can you imagine the consequences of gathering data from, say, The New York Times? Or any other website whose owner takes their intellectual property seriously. That’s why you always need to make sure you’re not violating anyone’s ownership rights when scraping data.
You might be breaking some local laws
Here's an example: LinkedIn sued users who anonymously gathered data from this social media platform. There are several reasons why this organization is suing them, and one of them is violation of the California Penal Code (LinkedIn is based in California.) While the chances that you will get sued for breaking some local laws are low, it’s still a risk you should consider.
Because many people mindlessly gather data from the internet, web scraping has gained such a negative reputation. Most websites will protect themselves from data gathering partly because there are so many users who will disregard all the terms and warnings and proceed to scrape despite restrictions.
The arguments of web scraping supporters are logically strong: They might say that they weren’t using this technique for illegal purposes, or that they’re using the gathered data for personal needs. They might say that they’re not guilty at all as the script was gathering the data, not them personally. All these facts are true and could lead to the court considering this person innocent.
However, due to the lack of cross-state regulations, all these arguments are in the grey area meaning that a good lawyer can prove that the charged person is both innocent and guilty. So it’s better to prevent any charges from happening at all.
How to stay safe during web scraping?
Ideally, you want to ask a website owner for written consent on becoming a donor for you, although that’s a difficult thing to achieve sometimes. So here is what you can do:
- If there is an API for data collection, use it
- Respect the terms of service
- Respect robots.txt
- Make sure that the data is not copyrighted
- Don’t publish gathered information without the permission of its owner
With these simple rules, you can safeguard yourself against getting charged for the violation of someone’s privacy. Remember that it’s important to respect the rights of other parties online — we often forget that because there is no face-to-face communication during data gathering. We encourage you to follow our advice and avoid web scraping abuse.