Data Gathering Issues: How to Deal with CAPTCHAs?

CAPTCHA is a powerful tool for distinguishing between human and bot traffic. How does it work — and is it possible to circumvent it? Let's find out.

Data Gathering Issues: How to Deal with CAPTCHAs?
Article content
  1. What is CAPTCHA?
  2. What is reCAPTCHA?
  3. How does AI help with CAPTCHAs?

CAPTCHAs are annoying: They make us doubt our ability to read symbols or find a bus on the picture every time they ask us to do that. Trying to catch all the robots, CAPTCHAs turned into full-blown riddles we often have to bend our mind over to solve. Yet, while it’s still possible for us, humans, to deal with CAPTCHAs, bots have difficulties cracking them. And it becomes a significant issue when we’re trying to gather data from the internet.

That’s where artificial intelligence and machine learning come to the rescue. We can add handy plugins to our scrapers to teach them to recognize and solve CAPTCHAs. In this article, we will dive deeper into this topic.

What is CAPTCHA?

Here's what the word "CAPTCHA" stands for:

Completely Automated Public Turing Test to Tell Computers and Humans Apart

And you thought the abbreviation is too long, huh. The purpose of this tool is obvious — it’s meant to help web services distinguish between humans and robots. The roots of CAPTCHAs go back to the 1950s when Alan Turing created the Turing Test that was supposed to determine if computers had the same thought patterns as humans.

The test was quite straightforward: An interrogator asked two participants questions without knowing who is who. One of the participants was a human, and the other one — a computer. An interrogator had to play a game of guess trying to determine who is the robot. If they couldn’t, the computer was considered a winner.

Communication between the test participants

Later, developers created CAPTCHAs based on this test by Alan Turing. This tool became necessary because people started using bots for harmful purposes — to perform a DDoS attack or send spam, for example. Or to buy tickets for a hot show in bulk to resell them at a higher price.

In most cases, CAPTCHAs look like distorted letters and numbers that are somewhat easy to read for a human but impossible to understand for a robot. Also, there are tests that make us find certain things in photos. But we will talk about the kinds of CAPTCHAs a bit later.

An example of a CAPTCHA image

The logic of this tool is simple: Humans can generalize things. This means that, if we know that we can sit in the chair, we don’t get confused about what to do with a couch when we see it. We know we can sit on it, too. Computers, however, struggle with this. Another human ability CAPTCHAs use is that we can detect patterns where they don’t really exist. For example, we can see familiar shapes in fairly shapeless things such as clouds and large spots — but robots are unable to do that.

What is reCAPTCHA?

reCAPTCHA logo

reCAPTCHA is a service provided by Google. It does the same job as a regular CAPTCHA, and it’s free. The first test of this tool is simple — it just asks you to tick a box to confirm you’re a human. Then, if the system is still suspicious of you, it will ask you to do something else to prove you’re not a bot.

In most cases, when we deal with CAPTCHAs, it’s reCAPTCHA. This service becomes more complex as bots learn how to solve all those tasks. Let’s see different types of reCAPTCHAs.

Images

CAPTCHA asks to select certain objects

We see this type often. This type of CAPTCHA will provide us with 9 or 16 square images asking us to find certain objects on them. And then we start looking for planes, buses, bikes, and other things. The correct answer would be the one submitted by most users who solved this test. That’s why sometimes even if we give a correct response, reCAPTCHA still thinks it's wrong.

A simple checkbox

CAPTCHA asks to check the box

This must be our favorite one: When a CAPTCHA asks us to just check a box to prove we are humans, it’s an easy thing to do. Well, for us, humans. Robots can’t pass this test. But why? It’s so simple!

Well, the real test doesn’t boil down to just reading a text and ticking a box. It’s in the way we move our mouse. A human never can move a cursor creating a straight line because our hands are always a bit shaky. Robots can’t mimic this pattern. In addition, these reCAPTCHAS might take a look at HTTP cookies that become available for a destination server as you try to enter a website.

No actions needed

reCAPTCHA evolved to the point where it doesn’t need any tests to understand you’re not a robot. This tool can analyze the behavior of a user, their history of interactions with other sites, and the data they pass when entering a website. So often, reCAPTCHA won’t need you to perform any additional actions to prove you’re a human — it will already know that.

How does reCAPTCHA understand that it needs an additional test?

So if this tool became so smart, why do we still need to solve CAPTCHAs sometimes? It’s unclear what exactly triggers it, but the potential causes are:

  • Cookies,
  • Browser history, and/or
  • Mouse movements.

How does AI help with CAPTCHAs?

First of all, we should note that modern CAPTCHAs are based on artificial intelligence themselves. They learn correct answers by analyzing user responses. So machine learning is at the core of all those riddles.

That’s why it’s rather ironic yet logical that the very same artificial intelligence is used to let bots solve CAPTCHAs. Image recognition helps robots see the same patterns we do. Therefore, they become able to solve CAPTCHAs without much effort.

Does it mean you can bypass CAPTCHAs?

Yes, by adding an image recognition tool to your scraper you can avoid it getting blocked by CAPTCHAs and enjoy smooth data gathering. But remember that it’s not just the puzzles. We’ve mentioned that reCAPTCHAs detect movement patterns and cookies to understand if they deal with a bot or a human. That’s why there are different libraries that let you provide a scraper with data needed to mimic a real request. Also, modern scrapers can at least somehow mimic the behavior of a real user.

Yet, if you let your scraper run without proxies, even all the fancy add-ons won’t do much because all requests will come from the same IP address. And that’s a clear indication of robot activity. To address this problem, use residential Infatica proxies for effortless data gathering to automate IP switching and make sure websites don’t become suspicious towards your scraper.

👷‍♂️ Further reading: How Residential Proxies Simplify Data Gathering for Price Aggregators


Olga Myhajlovska

Olga Myhajlovska

Olga Myhajlovska is a freelance writer who likes to focus on the practical side of different technologies: Her stories answer both the "How does it work?" and "Why is it important for me?" questions.

Get In Touch

Have a question about Infatica? Get in touch with our experts to learn how we can help.