OCR character recognition – translating images to text

Recently I had to translate a large stack of paperwork into text. Instead of doing this manually (by typing in all the content myself), I decided to test out some of the OCR software for Linux.

There’s a few out there, but the most popular is by far gocr (http://jocr.sourceforge.net/) and perhaps tesseract (http://code.google.com/p/tesseract-ocr/).

Scanning the documents

I used xsane to scan the documents, since it seemed like the easiest option. There is an option to Save file automatically upon scanning, and an option to increment the file number, so you can start with page 1 and it will auto save and auto increment. I used tiff files rather than jpeg just so there aren’t any compression related artefacts. Scanned in greyscale at 150 resolution, this seemed to work well.

Using OCR on the images

I used both gocr and tesseract on the resulting .tiff files and found that tesseract was yielding vastly superior results. I decided that I didn’t really want to spend the time messing around with gocr, so just used tesseract.

I ran a single command to translate all the .tiff images to .txt:

for i in `ls -1 | grep tiff | awk -F".tiff" '{print $1}'`; do tesseract $i.tiff $i; done

tesseract saves a file with a .txt extension, so there’s no need to define the extension in this scenario.

If using gocr, the same command would be:

for i in `ls -1 | grep tiff | awk -F".tiff" '{print $1}'`; do gocr -i $i.tiff -o $i.txt; done

tesseract and gocr usage and command line options

tesseract usage is:

Usage: tesseract imagename outputbase [-l lang] [configfile [[+|-]varfile]...]

gocr usage is:

Optical Character Recognition --- gocr 0.44
using: gocr [options] pnm_file_name # use - for stdin
options (see gocr manual pages for more details):
-h - get this help
-i name - input image file (pnm,pgm,pbm,ppm,pcx,...)
-o name - output file (redirection of stdout)
-e name - logging file (redirection of stderr)
-x name - progress output to fifo (see manual)
-p name - database path including final slash (default is ./db/)
-f fmt - output format (ISO8859_1 TeX HTML XML UTF8 ASCII)
-l num - threshold grey level 0<160<=255 (0 = autodetect)
-d num - dust_size (remove small clusters, -1 = autodetect)
-s num - spacewidth/dots (0 = autodetect)
-v num - verbose (see manual page)
-c string - list of chars (debugging, see manual)
-C string - char filter (ex. hexdigits: 0-9A-Fx, only ASCII)
-m num - operation modes (bitpattern, see manual)
-a num value of certainty (in percent, 0..100, default=95)
examples:
gocr -m 4 text1.pbm # do layout analyzis
gocr -m 130 -p ./database/ text1.pbm # extend database
djpeg -pnm -gray text.jpg | gocr - # use jpeg-file via pipe

For further reading:
http://jocr.sourceforge.net/
http://code.google.com/p/tesseract-ocr/

Leave a Reply

Your email address will not be published. Required fields are marked *

*