Honeypots: What Are They? Avoiding Them in Data Gathering

Honeypots may pose a serious threat to your data collection capabilities: They can detect web crawlers and block them. In this article, we're exploring how they work and how to avoid them.

Honeypots: What Are They? Avoiding Them in Data Gathering
Pavlo Zinkovski
Pavlo Zinkovski 6 min read
Article content
  1. What is a honeypot?
  2. Types of honeypots
  3. What are honeypots used for?
  4. How to avoid honeypots during web-scraping?
  5. Frequently Asked Questions

Honeypots are a very interesting tool: They were initially created to protect servers from hacker attacks, but as bots and data gathering became more and more widely used, honeypots gained another purpose — to protect websites from web scraping. However, as we all know, every issue finds its solution. And just like the web scraping problem was addressed by honeypots, we as data gatherers found a solution to honeypots.

In this article, we will tell you all you need to know about this tool and how to avoid honeypots while gathering data.

What is a honeypot?

Web crawlers visits a honeypot website

A honeypot is a decoy created to look like a compromised system that will seem like an easy target for malefactors. Thanks to honeypots, it’s easy to distract hackers, thus protecting the real possible target. Also, cybersecurity specialists use this tool to study the activity of cybercriminals as well as find and solve vulnerabilities.

Types of honeypots

Depending on the purpose and the scale, honeypots fall into numerous groups. To avoid overloading you with unnecessary information, we will focus just on the main types and those relevant to data gathering. By scale and complexity, honeypots can be pure, low-interaction, and high-interaction.

Pure honeypots

Under this type fall all full-scale systems that appear to hackers as a place where they can get some valuable and confidential information. Such a honeypot would track the actions of malefactors using a bug tap. These complex systems require a lot of effort and time to establish but in exchange, they bring lots of extremely important and useful data that can help cybersecurity specialists to protect real systems better.

Low-interaction honeypots

These are decoys for the most common targets of attackers. Low-interaction honeypots are thus much simpler and easier to establish and manage. They would gather data on the kinds of attacks malefactors perform and their origin. Usually, these honeypots become early detection mechanisms.

High-interaction honeypots

These are the most complex decoys that offer malefactors numerous targets of different kinds. High-interaction honeypots unite several services and therefore are rather resource-demanding and expensive. But the reward is large, too: Such honeypots provide researchers with huge amounts of information on malicious activity and make sure that hackers don’t get their hands on the real systems.

Honeynets

Multiple honeynets are interconnected in a honeynet

There is another type that deserves a separate explanation — a honeynet. It’s a network of honeypots that are used to monitor large-scale systems that require more than one honeypot. Honeynets have their own firewalls that monitor all the incoming traffic and lead it to honeypots. This decoy network gathers data about malicious activity while protecting the real network, too.

Researchers use honeynets to study DDoS and ransomware attacks, and cybersecurity specialists use them to protect corporate networks as a honeynet contains all the incoming and outcoming traffic.

What are honeypots used for?

We’ve determined two main purposes of this tool — research and protection. But let’s look at these purposes in more detail to gain a better understanding of honeypots and the way they work. Of course, we won’t list all the purposes here — it would take us quite a while. We will focus the attention on the most popular uses.

Spam detection

Honeypot protects from spam attacks

Since honeypots gather all the data about a user who enters it, they reveal the IP address. And as you might know, multiple and extremely frequent requests that come from a single IP address are exactly what we call spamming. As the IP becomes known, honeypot owners can ban this user from the real system — and maybe multiple others if the data gets shared between several systems. So here goes our data gathering goal if we don’t hide our IP addresses.

Protection from SQL injections

Such an attack is difficult to detect and often firewalls fail to catch it. Honeypots can successfully attract malefactors with an SQL injection in mind and divert them from the real database they were actually aiming for.

Protection from malware

Honeypots can be also used to lure malware into the trap. They would use known attack vectors and replication to make malware infect an emulated system instead of a real one. Then cybersecurity specialists can use these viruses to study them and upgrade the antivirus software they’re working on.

Searching for malicious servers

There are client honeypots that play a role of a client. They look for malicious servers that harm clients and interact with them. Such honeypots are used by researchers to understand the way malicious servers work and how they modify attacked client servers.

Anti-crawler

Similar to anti-spam honeypots, there are anti-crawler honeypots. They exist to protect websites from data stealing. However, there is a downside — they can’t tell malicious crawlers from lawful ones. So even if you gather only the widely available information for legal needs, you still will get impacted by honeypots.

How to avoid honeypots during web-scraping?

The solution is obvious — you need to change your IP address with each request, and you’ll run a much lower risk of getting blocked. You can successfully and easily do that with residential proxies: These are IP addresses of real existing devices, so there is no chance honeypots will think your crawler is a bot. The request from your crawler will be sent to one of the devices and only then to the target server. Thus, the target server will see the IP address of a proxy which will make it think that the user is unique.

Another thing you need to know is that some honeypot links will contain the CSS style of display:none. That’s could be a way for you to detect a honeypot. Other honeypots can blend in links with the background color, so make sure your crawler follows just those links that are properly visible.

And finally, follow the rules of successful web scraping such as not making requests too frequent and using different headers for your requests. All these methods will make your crawler appear as a real user, not a bot, allowing you to gather all the needed data.

Frequently Asked Questions

It depends on the country and state. In the United States, for example, there are no specific laws that prohibit the use of honeypots; however, they can fall under the category of surveillance devices, which are subject to regulation by the Federal Communications Commission (FCC). Some countries have more specific laws around honeypots, and using them without proper authorization may be illegal.

Honeypots are still used by many companies and organizations as a part of their security strategy. Organizations have been using honeypots for many years now, and they are constantly evolving as new threats emerge. They can be very effective at deterring and detecting attacks, but they need to be configured properly and monitored closely in order to be most effective.

Honeypots can be a security risk if they're not properly configured. Honeypots can be used to lure attackers into revealing their identity or to steal sensitive information, but this requires a properly secure environment – otherwise, the attackers may detect the honeypot.

A honeypot is usually not considered entrapment because it is not an intentional act on the part of the police officer to get someone to commit a crime. A honeypot is instead an intentional act on the part of the hacker to try and get information from someone.

Honeypots are particularly useful in network security because they can serve as decoys, bait for attackers. By luring attackers to honeypots, we can learn about their tactics, techniques, and procedures (TTPs). This information can then be used to better secure our network. Honeypots can also help us discover zero-day exploits and track the spread of malicious code across the internet.

As infatica`s CTO & CEO, Pavlo shares the knowledge on the technical fundamentals of proxies.

You can also learn more about:

How to Scrape Job Postings: Tools, Techniques, and Real-World Insights
How to
How to Scrape Job Postings: Tools, Techniques, and Real-World Insights

This in-depth guide covers everything you need to know about web scraping job postings, from tools to analysis. Let's uncover hiring trends and gain a competitive edge!

How To Make Money From Web Scraping
Web scraping
How To Make Money From Web Scraping

A complete guide to monetizing web scraping: Explore programming languages, tools like Scrapy and Beautiful Soup, and techniques to overcome common challenges while staying within legal boundaries.

Building a Real Estate Data Scraper: Code, Proxies, and Best Practices
Building a Real Estate Data Scraper: Code, Proxies, and Best Practices

Looking to scrape real estate data? Our guide covers everything from the basics to advanced techniques, including using proxies and processing data for actionable insights.

Get In Touch
Have a question about Infatica? Get in touch with our experts to learn how we can help.