Honeypots are a very interesting tool: They were initially created to protect servers from hacker attacks, but as bots and data gathering became more and more widely used, honeypots gained another purpose — to protect websites from web scraping. However, as we all know, every issue finds its solution. And just like the web scraping problem was addressed by honeypots, we as data gatherers found a solution to honeypots.
In this article, we will tell you all you need to know about this tool and how to avoid honeypots while gathering data.
What is a honeypot?
A honeypot is a decoy created to look like a compromised system that will seem like an easy target for malefactors. Thanks to honeypots, it’s easy to distract hackers, thus protecting the real possible target. Also, cybersecurity specialists use this tool to study the activity of cybercriminals as well as find and solve vulnerabilities.
Types of honeypots
Depending on the purpose and the scale, honeypots fall into numerous groups. To avoid overloading you with unnecessary information, we will focus just on the main types and those relevant to data gathering. By scale and complexity, honeypots can be pure, low-interaction, and high-interaction.
Pure honeypots
Under this type fall all full-scale systems that appear to hackers as a place where they can get some valuable and confidential information. Such a honeypot would track the actions of malefactors using a bug tap. These complex systems require a lot of effort and time to establish but in exchange, they bring lots of extremely important and useful data that can help cybersecurity specialists to protect real systems better.
Low-interaction honeypots
These are decoys for the most common targets of attackers. Low-interaction honeypots are thus much simpler and easier to establish and manage. They would gather data on the kinds of attacks malefactors perform and their origin. Usually, these honeypots become early detection mechanisms.
High-interaction honeypots
These are the most complex decoys that offer malefactors numerous targets of different kinds. High-interaction honeypots unite several services and therefore are rather resource-demanding and expensive. But the reward is large, too: Such honeypots provide researchers with huge amounts of information on malicious activity and make sure that hackers don’t get their hands on the real systems.
Honeynets
There is another type that deserves a separate explanation — a honeynet. It’s a network of honeypots that are used to monitor large-scale systems that require more than one honeypot. Honeynets have their own firewalls that monitor all the incoming traffic and lead it to honeypots. This decoy network gathers data about malicious activity while protecting the real network, too.
Researchers use honeynets to study DDoS and ransomware attacks, and cybersecurity specialists use them to protect corporate networks as a honeynet contains all the incoming and outcoming traffic.
What are honeypots used for?
We’ve determined two main purposes of this tool — research and protection. But let’s look at these purposes in more detail to gain a better understanding of honeypots and the way they work. Of course, we won’t list all the purposes here — it would take us quite a while. We will focus the attention on the most popular uses.
Spam detection
Since honeypots gather all the data about a user who enters it, they reveal the IP address. And as you might know, multiple and extremely frequent requests that come from a single IP address are exactly what we call spamming. As the IP becomes known, honeypot owners can ban this user from the real system — and maybe multiple others if the data gets shared between several systems. So here goes our data gathering goal if we don’t hide our IP addresses.
Protection from SQL injections
Such an attack is difficult to detect and often firewalls fail to catch it. Honeypots can successfully attract malefactors with an SQL injection in mind and divert them from the real database they were actually aiming for.
Protection from malware
Honeypots can be also used to lure malware into the trap. They would use known attack vectors and replication to make malware infect an emulated system instead of a real one. Then cybersecurity specialists can use these viruses to study them and upgrade the antivirus software they’re working on.
Searching for malicious servers
There are client honeypots that play a role of a client. They look for malicious servers that harm clients and interact with them. Such honeypots are used by researchers to understand the way malicious servers work and how they modify attacked client servers.
Anti-crawler
Similar to anti-spam honeypots, there are anti-crawler honeypots. They exist to protect websites from data stealing. However, there is a downside — they can’t tell malicious crawlers from lawful ones. So even if you gather only the widely available information for legal needs, you still will get impacted by honeypots.
How to avoid honeypots during web-scraping?
The solution is obvious — you need to change your IP address with each request, and you’ll run a much lower risk of getting blocked. You can successfully and easily do that with residential proxies: These are IP addresses of real existing devices, so there is no chance honeypots will think your crawler is a bot. The request from your crawler will be sent to one of the devices and only then to the target server. Thus, the target server will see the IP address of a proxy which will make it think that the user is unique.
Another thing you need to know is that some honeypot links will contain the CSS style of display:none
. That’s could be a way for you to detect a honeypot. Other honeypots can blend in links with the background color, so make sure your crawler follows just those links that are properly visible.
And finally, follow the rules of successful web scraping such as not making requests too frequent and using different headers for your requests. All these methods will make your crawler appear as a real user, not a bot, allowing you to gather all the needed data.