OCR Language Auto-Detection in ABBYY FineReader Engine

OCR Language Auto-Detection

ABBYY OCR technology uses language information and dictionaries to achieve high recognition quality during the process of optical character recognition. Real documents can contain multiple languages on one page or the document stream contains a large number of different languages, e.g.
  • a publication that has the same content in two or more languages in different columns, for example airline magazines
    or
  • documents from the European Union can contain up to 25 or more different languages. The same applies for internal business documents of a globally acting enterprise.

Up to V10 Technologies

  • Even up to the ABBYY technology cycle V10 the OCR engine is able to process multiple languages documents.
  • The technology selected the best matching language from a group pre-defined group of languages, this group can/has to be set/edited by the user/developer.
  • It is/was recommended to use max. 5 different languages in a group. The more languages were selected the higher the number of internal OCR hypothesis. This could negativelly impactthe OCR quality as welll as the processing time.
  • If the language input is very mixed and consist of many different languages, then manual pre-sorting is often not an option. Instead multiple OCR runs with different language settings have to be made. Based on the internal recognition statistics, the system had to decide what combination delivered the best results.

SInce V11 Technologies

FineReader Engine 11 is the first SDK where a new language detection is implemented, it is part of the “FRDocument Object”

  • The recognition language of a document can be automatically detected, but the developer has to specify at least 3 languages that might show up in the document.
  • The recognition language is detected for each word in the text.

The API contains several different objects within the FRDocument object:

Name Description
BasicLanguage Returns the main language of the recognized document. The property contains the internal name of the first language in the collection of detected languages (DetectedLanguages property).
This property has a meaningful value only if the IRecognizerParams::DetectLanguage property has been set to TRUE during recognition; otherwise it is an empty string.
DetectedLanguages Provides access to the collection of recognition languages detected in the recognized document. Languages in the collection are sorted by the frequency of occurrence: from the most frequently occurred to the least.
This property has a meaningful value only if the IRecognizerParams::DetectLanguage property has been set to TRUE during recognition.
The list of languages is updated only after recognition, i.e. if you edit the layout of the document manually, the collection remains the same.

Here an illustration of the GUI of the ABBYY FineReader desktop application. 

Developers can implement a similar system using the API in the SDK.

language-detection-fr11-01.png

language-detection-fr11-02.png

Was this article helpful?

1 out of 3 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.