../
  1. Part 1: Tesseract and Image Quality (this post)
  2. Part 2: Tesseract and Image Dimensions
  3. Part 3: Tesseract and Image Rotation

TLDR Summary

While image quality does impact the results from Tesseract, even at extremely low image quality values the impact of JPG artifacts and image noise is minimal on Tesseract OCR results.

Using Tesseract

Using Tesseract is extremely simple. A minimal C++ example is available from this github repo. You may also be interested in this youtube video:

The relevant C++ Tesseract lines -- without any error checking -- are:

#include <tesseract/baseapi.h> #include <opencv2/opencv.hpp> int main() { std::unique_ptr<tesseract::TessBaseAPI> tess(new tesseract::TessBaseAPI()); tess->Init(nullptr, "eng"); tess->SetPageSegMode(tesseract::PSM_SINGLE_BLOCK); cv::Mat mat = cv::imread("image.png"); tess->SetImage(mat.data, mat.cols, mat.rows, mat.channels(), mat.step1()); tess->SetSourceResolution(70); std::unique_ptr<char[]> txt(tess->GetUTF8Text()); std::cout << txt.get(); // ...

Note that GetUTF8Text() can return nullptr, so at the very least this should be verified when using txt.

We're going to use this code with the following black-and-white PNG image:

The results from Tesseract are quite impressive. There are a few formatting differences, but the content of the text is identical. Here are the results, side-by-side with the original image on the left and the text from Tesseract on the right:

Rights and freedoms in Canada The Canadian Charter of Rights and Freedoms guarantees the rights and freedoms set out in it subject only to such reasonable limits prescribed by law as can be demonstrably justified in a free and democratic society. Fundamental freedoms Everyone has the following fundamental freedoms: (a) freedom of conscience and religion; (b) freedom of thought, belief, opinion and expression, including freedom of the press and other media of communication; (c) freedom of peaceful assembly; and (d) freedom of association. Democratic rights of citizens Every citizen of Canada has the right to vote in an election of members of the House of Commons or of a legislative assembly and to be qualified for membership therein.

Image Quality

When using JPG images, the level of quality only has a minor impact on the results from Tesseract. By the time we reach q=30, Tesseract does a decent job at extracting the text and the formatting from the image.

The results in the table below show the image on the left and the text from Tesseract on the right:

image quality image Tesseract results Comments
q=10 Rights and freedoms in canada The Canadian Charter of Rights and Freedoms guarantees the rights and freedoms set out in it subject only to such reasonable limits prescribed by law as can be demonstrably justified in a free and democratic society. Fundamental freedoms Everyone has the following fundamental freedoms: (a) freedom of conscience and religion; (b) freedom of thought, belief, opinion and expression, ‘including freedom of the press and other media of communication; (c) freedom of peaceful assembly; and (d) freedom of assoctatton. Democratic rights of citizens Every citizen of Canada has the right to vote in an election of members of the House of Commons or of a legislative assembly and to be qualified for membership therein.
  1. lowercase C in Canada
  2. single quote at start of line before the word including
  3. 2 typos in the word association
q=20 Rights and freedoms in Canada The Canadian Charter of Rights and Freedoms guarantees the rights and freedoms set out in it subject only to such reasonable Limits prescribed by law as can be demonstrably justified in a free and democratic society. Fundamental freedoms Everyone has the following fundamental freedoms: (a) freedom of conscience and religion; (b) freedom of thought, belief, opinion and expression, including freedom of the press and other media of communication; (c) freedom of peaceful assembly; and (d) freedom of association. Democratic rights of citizens Every citizen of Canada has the right to vote in an election of members of the House of Commons or of a legislative assembly and to be qualified for membership therein.
  1. uppercase L in limits
q=30
up to
q=90
Rights and freedoms in Canada The Canadian Charter of Rights and Freedoms guarantees the rights and freedoms set out in it subject only to such reasonable limits prescribed by law as can be demonstrably justified in a free and democratic society. Fundamental freedoms Everyone has the following fundamental freedoms: (a) freedom of conscience and religion; (b) freedom of thought, belief, opinion and expression, including freedom of the press and other media of communication; (c) freedom of peaceful assembly; and (d) freedom of association. Democratic rights of citizens Every citizen of Canada has the right to vote in an election of members of the House of Commons or of a legislative assembly and to be qualified for membership therein.
  1. no differences (other than whitespace)

By the time we've reached q=30, the results are the same as if a lossless image format like PNG had been used. Considering the speed and size implications of JPG images where a quality setting of q=70 is recommended, this is well within the limits and thus shouldn't cause any concerns when using Tesseract for OCR.

Continue reading with part 2, Tesseract and Image Dimensions.

Last modified: 2022-03-28
Stéphane Charette, stephanecharette@gmail.com
../