How to Read and Solve Captcha Using Python?

How to Read and Solve Captcha Using Python?

You have probably encountered those annoying messages on registration or feedback pages that read, “Enter the letters you see on the image,” or “Select the images with a…” These are known as captchas, and they are designed as gates to let you in.

CAPTCHA stands for “Completely Automated Public Turing Test to Tell Computers and Humans Apart”.

Simply put, they are intended to differentiate between humans and automated users, such as bots. The text is created so that a human can read it without difficulty, whereas a machine cannot.

In practice, however, this rarely works because almost every simple text captcha posted on the site is cracked within a few months.

What are CAPTCHAs used for?

As we have mentioned, sites use CAPTCHAs to restrict bots. But why shouldn’t bots be allowed to access these sites? Here are some more specific uses.

  • CAPTCHAs are used to prevent online poll skewing by ensuring that every single vote is entered by a human. It also maintains poll accuracy by discouraging multiple voting, as it makes the time required for each vote longer.
  • Sites also use captchas to prevent bots from accessing registration pages and creating fake accounts. This reduces the wastage of the site’s resources and minimizes any chances of fraud.
  • Ticketing sites use CAPTCHAs to limit scalpers from making false registrations for free events and buying multiple tickets for resale.
  • Most systems require human feedback for all of their contact forms, reviews, and messaging boards. CAPTCHAs prevent false registrations, and hence false comments and online harassment.

How Do CAPTCHAs Hinder Web Scraping?

Most websites have automatic captchas, which are triggered if a website detects unusual activities that may resemble bot behavior. These include behaviors such as unlimited requests within split seconds and clicking on links at a far higher rate than humans would do.

Captchas can be a major impediment during the web scraping process, as most scraping operations are carried out and performed by the automated bots you use to scrape. However, this should not worry you.

There are several ways to overcome captchas when scraping the web. One way is to use Python programming by writing original code from scratch or using available code. However, to avoid too many inconveniences, you can also opt for an automatic site unblocker to help you dodge captchas successfully.

Decoding Image Captchas Using Python

The most common captcha is the image code captcha, which contains distorted letters that a computer program cannot detect easily, but a human can somehow manage to understand. When web-scraping, you can extract the letters from the image using Python. Here’s how.

After accessing the captcha in a useful format, you can employ the help of Optical Character Recognition, which comes in handy for extracting text from images.

You can also use open-source Tesseract, an optical character recognition tool for Python, to recognize and “read” the text embedded in the image. It can be installed using the pip command.

pip install pytesseract

The first step is to extend the original Python script that loaded the captcha. This will produce a different script to read the captcha in black-and-white mode as follows.

import pytesseract
img = get_captcha(html)
img.save('captcha_original.png')
gray = img.convert('L')
gray.save('captcha_gray.png')
bw = gray.point(lambda x: 0 if x < 1 else 255, '1')
bw.save('captcha_thresholded.png')

# The format is now easy and 
# can be passed to tesseract as follows
pytesseract.image_to_string(bw)

When run, the output of this final script is the captcha of the form you are trying to access.

If you are new to web scraping, read frequently asked questions on web scraping.

Extra Tips for Bypassing Captchas

Rotate Proxies

As we mentioned earlier, sending frequent requests and clicking on links continuously are considered bot behaviors and can make websites employ captchas to block access. To solve this, you have to rotate proxies every time you send a request to the website. The clean residential IP proxies will help avoid captchas that trigger while you scrape, as your IP address will not be shown.

Rotate User Agents

Merely changing a user agent will not be enough to prevent websites from restricting access when you send many requests at the same time. You will have to rotate the user agents to make the target website view you as different devices sending requests.

This is all about how to solve captcha using Python. If you still fail to solve the captcha with your code, let’s discuss it in the comments.

Leave a Reply

Your email address will not be published. Required fields are marked *