Stopping spam and digitizing books, word by word

Filed under

With most folks focused on Facebook apps or the latest YouTube craze, a lot of incredible projects go unnoticed for way too long. Several months ago someone on irc.freenode mentioned reCAPTCHA to me, an amazingly novel idea that keeps spammers out while digitizing literary works.

As CAPTCHAs become a bigger part of our sign-in and sign-up activities on the net, so does the time we spend typing in those warped letters presented on login and comment forms. A couple of bright individuals had the brilliant idea of building a CAPTCHA system to channel CAPTCHA entries to digitize old books in the Internet Archive, currently numbering over 200,000.

The system is engineered to focus on words that prove difficult for OCR engines to properly identify, as shown in the example below.

Words that cannot be recognized by computers are fed into the reCAPTCHA database. The reCAPTCHA Project provides an easy-to-use API for several languages that allows developers to integrate reCAPTCHA into their site's anti-spam systems. Plugins also exist for several popular CMS's including Drupal, Wordpress, and Joomla. Try out the example below.

Note: This reCAPTCHA widget did not render in my online news reader, if you'd like to try it out, you may need to visit the post or reCAPTCHA's site.




value="manual_challenge">

The process includes error checking by distributing the same word for entry by multiple people and then taking the result with the highest number of results. This means little typos are okay when you're trying to login and that these typos will be thrown out of the reCAPTCHA database.

Learn more about reCAPTCHA and sign up to use it on your site at recaptcha.net.
UPDATE: I'm now drinking reCAPTCHA kool-aid. Just installed the Drupal reCAPTCHA module for this site and my family web site. A nice and easy install.

Share