Breaking CAPTCHA or why you should be using reCAPTCHA V3
CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) was created in the early 2000s to determine that the actions being taken on a browser is by a real user and not by spammers aka bots.
Implementing Captcha
Initial versions of captcha transforms texts and adds in various transformations (e.g lines, character rotations, space reduction between words). The purpose is to confound robots but at the same time providing enough data for humans to decipher the text.
For Java systems, there is an existing library which does this nicely. You can refer to SimpleCaptcha and its document to crate a myriad of Captcha images to confound the bots (and possibly the users as well!)
With text recognition such as these, there is usually a trade off between how complex you want to the image to become. It is certainly possible to make the Captcha complex by adding Fisheye and drop shadow effects plus plenty of noise, however it will sometimes render the image almost indecipherable by humans too.
Therefore you will come to realize that most implementation tends to gravitate towards simple distortions such as reducing character spacing, adding a noisy background or overlaying some elements on top of the characters.
Breaking SimpleCaptcha
For a start, we will attempt to solve a simple captcha implementation using the following code segment
To solve this, we will be using Tesseract OCR. Tesseract 4.0 introduces a new neural net (LSTM) based OCR engine which we will be using in this article.
The source code for this project can be found at my Tesseract-OCR Github Repo
- First we setup a new console project using .NET Core
2. At the Target Framework screen, make sure you choose .NET 5.0 and then click Create
3. Now install Tesseract to your project using the following Nuget command
Install-Package Tesseract
4. There is a need to pull down some trained data (this are found inside tessdata folder in my github), or you can grab the latest off https://tesseract-ocr.github.io/tessdoc/
For simplicity, a tessdata folder is created in the project, with each file set to “Copy if newer” so that it is copied to the compiled folder
5. A Tesseract config.txt file is also created with the following parameters, the most important configuration is tessedit_char_whitelist which tells Tesseract to only find for these characters
load_system_dawg false
load_freq_dawg false
tessedit_char_whitelist abcdefghijklmnopqrstuvwxyz0123456789
6. Now we put them all together in program.cs.
Tesseract OCR in action
Running it gives the following output
Not very confident, trying again with a larger trained dataset (eng_default)
This is still not good enough, but given that most captcha allow for refreshing of the image, one can always setup a bot to get a new captcha and try this process repetitively to get to a higher confidence response.
Concluding thoughts
The code sample provided is intentionally bare as the intent is not to demonstrate working samples for breaking captchas, but rather show that machine learning and trained models are sufficiently good enough to crack most traditional captcha systems without breaking a sweat.
Captchas is one of the recommended methods by OWASP to prevent automated and/or brute-force attacks (https://owasp.org/www-community/controls/Blocking_Brute_Force_Attacks#sidebar-using-captchas)
With AI models increasingly built into systems, it is time for developers to explore and look at implementing more user-friendly and secure captcha systems such as ReCaptcha V3. Such implementations are transparent to the users but will work strongly to deter most web automation/bot systems.