This article is about installing and trying out the standard OCR software Tesseract 5.1.0 on a LINUX Ubuntu 20.04.3 system.
Open Source OCR Software
OCR (Optical Character Recognition) can read text from the graphical pixel representation of an image like a screenshot.
Tesseract was developed by HP since 1985, was open-sourced in 2005, and continued since by Google. It is a natively compiled library written in C++. For WINDOWS there are DLLs available, but for LINUX you need to build it from source.
If you read the
Dependencies
section on the Tesseract README.md
,
you will find out that it depends on
Leptonica.
This is a native library that can perform image manipulations, written in C.
Looks like also this needs to be built from source.
Tesseract would remind you about missing Leptonica
when you run its configure
script.
Thus I downloaded their source codes by clicking the "Code" button on
both Tesseract
and Leptonica git pages,
extracting their ZIPs to the programs
folder in my HOME directory.
Build Leptonica
For instructions I looked into the README.html
file that was in the downloaded directory.
As normal user, I opened a command line terminal (console) and ran:
cd $HOME/programs/leptonica-master ./autogen.sh ./configure make sudo make install make clean
This took quite a while, I think about 20 minutes. One has to look into the output of these commands whether everything was alright. You can use "echo $?" immediately after any command to check whether it succeeded, response should be "0" (zero), everything else is an error.
The make install
command writes to /usr/local/lib
,
thus it needs super-user rights ("sudo").
Build Tesseract
For instructions I looked into the INSTALL
text-file that was in the downloaded directory.
This is what I launched:
cd $HOME/programs/tesseract-main ./autogen.sh ./configure make sudo make install sudo ldconfig make clean
That build was faster than Leptonica.
Afterwards I downloaded two "trained" language data files for
English and
German from
the tessdata git page.
I moved them to /usr/local/share/tessdata
:
sudo mv eng.traineddata deu.traineddata /usr/local/share/tessdata
You can check that this is where Tesseract expects them by entering:
tesseract --list-langs
Response was:
List of available languages in "/usr/local/share/tessdata/" (2): deu eng
Time for a reality check!
Test
Now I wanted to see some achievement.
I used ksnip
to create this screenshot as ocr-test.png
:
Then I started the OCR scanning:
tesseract ocr-test.png stdout
The argument stdout
makes it write the recognized text to the console.
And this is what I got:
Welcome Software Blog Contact Software "z IEVER IR eI M\ ilale))
Looks like images and non-letter characters confuse the tool. Most likely I need to spend some more time on this. Or is there a better solution on the web than 37 years old Tesseract?