Optical Character Recognition (OCR) software convert image into text data. Installation of Tesseract in Debian Wheezy is simple.
Command line OCR tool
The Tesseract OCR engine was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but since then it has been improved extensively by Google and is probably one of the most accurate open source OCR engines available. It can read a wide variety of image formats and convert them to text in over 40 languages.
$ sudo aptitude install tesseract-ocr
The following NEW packages will be installed:
liblept3{a} libtesseract3{a} tesseract-ocr tesseract-ocr-eng{a} tesseract-ocr-equ{a}
tesseract-ocr-osd{a}
0 packages upgraded, 6 newly installed, 0 to remove and 510 not upgraded.
Need to get 29.4 MB of archives. After unpacking 79.9 MB will be used.
Use ImageMagick to convert png or jpg file to tiff format.
$ convert sample2.png -auto-level -compress none sample2.tiff
Run tesseract to convert the image into text data
$ tesseract sample2.tiff sample2
The text data:
1. Alinonrls
There are a lot of health benefits associated with almonds.
Almonds are very high in vitamin E and protein as well as
other nutrients such as maglesinm and pllospllorns. Almonds
contain anLi-cancer properties as well. Whether almonds are
best raw or pasteurized is still a soI|rce of heated debate.
This page http://www.win.tue.nl/~aeb/linux/ocr/tesseract.html provides through explanation how to use Tesseract in GNU/Linux.