OCR Tools


Recently , in some of the task , we need to do R&D on following

  1. Multiple scanned images to multipage PDF
  2. Multipage PDF to multiple images
  3. PDF to searchable PDF
  4. PDF to text
  5. Images to text
  6. File meta-tagging (add properties)

I have done much R&D on the same ,

we found that Linux has the capability to do all of the above.

LINUX OCR TOOLS

https://help.ubuntu.com/community/OCR

http://williamjturkel.net/2013/07/06/doing-ocr-using-command-line-tools-in-linux/

  1. GOCR

http://wiki.ubuntuusers.de/gocr

GOCR is a command line tool for text recognition, which since 2000 by Joerg Schulenburg {En} / {En}is developed. The program is to configure “trainable” and varied, good results are obtained especially for sans-serif fonts. This is purely a character recognition program that works talked-independent. With GOCR tcl also a graphical user interface, which is not quite up to date exists. By default, xsane GOCR as text recognition program, with OcrGui some of the options can be used in a graphical user interface. Many OCR frontends can GOCR use (eg: ocrodjvu , OCRFeeder , gscan2pdf ).

  1. EXIFTOOL

http://wiki.ubuntuusers.de/ExifTool


:/var/www/ocr_images$ <b>exiftool -v</b>

Syntax:  exiftool [OPTIONS] FILE

Consult the exiftool documentation for a full list of options.

exiftool -Title=prags -Author=prags -Subject=testsubject -Keywords=Suraj,navin,jigar,prags INPUTFILENAME.jpg
/var/www/ocr_images$ exiftool 1.gif

Out put


ExifTool Version Number         : 8.60

File Name                       : 1.gif

Directory                       : .

File Size                       : 12 kB

File Modification Date/Time     : 2014:02:26 18:40:37+05:30

File Permissions                : rwxrwxrwx

File Type                       : GIF

MIME Type                       : image/gif

GIF Version                     : 89a

Image Width                     : 640

Image Height                    : 480

Has Color Map                   : Yes

Color Resolution Depth          : 1

Bits Per Pixel                  : 1

Background Color                : 0

XMP Toolkit                     : Image::ExifTool 8.60

Subject                         : testsubject

Title                           : prags

Author                          : prags

Keywords                        : Suraj,navin,jigar,prags

Image Size                      : 640x480

 

Install:


sudo apt-get install libimage-exiftool-perl

  1. IMAGEMAGICK

http://wiki.ubuntuusers.de/ImageMagick

convert command

  1. PDFOCR
    http://wiki.ubuntuusers.de/pdfocr

V 0.1.4

pdfocr   is a program that allows, from scanned PDF to make templates searchable documents. The in Ruby wrote script engages the text recognitionby default on the OCR program tesseract-ocr , optional Cuneiform Linux , or OCRopus   , back and used for merging the original with the text recognitionhocr2pdffrom Exact Image . Also, come pdftk and pdfimages used.

-i, –input [FILE]               Specify input PDF file

-o, –output [FILE]              Specify output PDF file

-t, –tesseract                  Use tesseract as the OCR engine (default)

-c, –cuneiform                  Use cuneiform as the OCR engine

-p, –ocropus                    Use ocropus as the OCR engine

-l, –lang [LANG]                Specify language for the OCR software

-w, –workingdir [DIR]           Specify directory to store temp files in

-k, –keep                       Keep temporary files around

-h, –help                       Show this message

-v, –version                    Show version


$pdfocr -i 3_nonserach.pdf -o 3_searchable.pdf

  1. TESSERACT-OCR

http://wiki.ubuntuusers.de/tesseract-ocr

tesseract 3.02

tesseract-ocr {En} is a command line program for text recognition . Originally from Hewlett Packard developed as a commercial program 1984-1995, the code was released in 2005. The development is supported by Google as an open source solution for creation of e-books was needed. The program supports a number of Western European and Asian languages ??such as Vietnamese. tesseract-ocr is a pure character recognition program, it does not provide layout analysis, and are plain text, version 3.00 also HOCR from. The text recognition can be “trained”.

Examples

//LINUX  ==> MUTIPLE IMAGES TO MULTI PAGE PDF (Page of PDF will be depend on number of input images)

convert -adjoin p1.jpg p2.jpg -quality 100 1_merged.pdf

//LINUX ==> PDF to MULTIPLE IMAGES PER THE PAGE COUNT OF PDF (HIGH QUALITY)

convert -density 500 1_merged.pdf 1_split.jpg

linux covert

Share on Facebook




About Pragnesh Karia

Pragnesh Karia, Open Source Enthusiast, Software Professional, Software Developer, Technical Lead ,Magento, Joomla ,Joomla LMS , Moodel LMS ,PHP ,Mysql, Ajax, Javascript, Jquery, Linux, Fan of Open Sources , Annet Technologies , SEO Analyst , Mootools