Blog-Archiv

Freitag, 1. April 2022

Installing Tesseract OCR on LINUX

This article is about installing and trying out the standard OCR software Tesseract 5.1.0 on a LINUX Ubuntu 20.04.3 system.

Open Source OCR Software

OCR (Optical Character Recognition) can read text from the graphical pixel representation of an image like a screenshot.

Tesseract was developed by HP since 1985, was open-sourced in 2005, and continued since by Google. It is a natively compiled library written in C++. For WINDOWS there are DLLs available, but for LINUX you need to build it from source.

If you read the Dependencies section on the Tesseract README.md, you will find out that it depends on Leptonica. This is a native library that can perform image manipulations, written in C. Looks like also this needs to be built from source. Tesseract would remind you about missing Leptonica when you run its configure script.

Thus I downloaded their source codes by clicking the "Code" button on both Tesseract and Leptonica git pages, extracting their ZIPs to the programs folder in my HOME directory.

Build Leptonica

For instructions I looked into the README.html file that was in the downloaded directory. As normal user, I opened a command line terminal (console) and ran:

cd $HOME/programs/leptonica-master
./autogen.sh
./configure
make
sudo make install
make clean

This took quite a while, I think about 20 minutes. One has to look into the output of these commands whether everything was alright. You can use "echo $?" immediately after any command to check whether it succeeded, response should be "0" (zero), everything else is an error.

The make install command writes to /usr/local/lib, thus it needs super-user rights ("sudo").

Build Tesseract

For instructions I looked into the INSTALL text-file that was in the downloaded directory. This is what I launched:

cd $HOME/programs/tesseract-main
./autogen.sh
./configure
make
sudo make install
sudo ldconfig
make clean

That build was faster than Leptonica.

Afterwards I downloaded two "trained" language data files for English and German from the tessdata git page. I moved them to /usr/local/share/tessdata:

sudo mv eng.traineddata deu.traineddata /usr/local/share/tessdata

You can check that this is where Tesseract expects them by entering:

tesseract --list-langs

Response was:

List of available languages in "/usr/local/share/tessdata/" (2):
deu
eng

Time for a reality check!

Test

Now I wanted to see some achievement. I used ksnip to create this screenshot as ocr-test.png:

Then I started the OCR scanning:

tesseract ocr-test.png stdout

The argument stdout makes it write the recognized text to the console. And this is what I got:

Welcome Software Blog Contact

Software

"z IEVER IR eI M\ ilale))

Looks like images and non-letter characters confuse the tool. Most likely I need to spend some more time on this. Or is there a better solution on the web than 37 years old Tesseract?