OCR Recognition Languages

ABBYY OCR technology can process more than 200 OCR languages of different types:

  • Natural languages, like English, Russian or German - but also languages with specific writing like Chinese (PRC and Taiwan), Japanese, Korean and Korean/Hangul, Thai, Hebrew, Arabic
  • Artificial languages: Esperanto, Interlingua, Ido, Occidental
  • Programming Languages: Basic, C/C++, COBOL, Fortran, Java, Pascal, Simple chemical formulas

Languages contain special language units or data types, e.g.:

  • Addresses
  • Date and time
  • Names, etc.
  • For some natural languages: City, village, settlement (English, United Kingdom); Currency in words (English, United States), etc.

The languages mentioned above are so-called 'predefined languages'. In addition, it is possible to define own languages and use them for recognition.

The screenshot below shows the “Language Editor” implementation of FineReader PDF, the desktop application for individual productivity.

finereader_language_editor.png

Structure of a Recognition Language

Every recognition language has the following properties:

  • Name
  • Set of allowed characters:
    • alphabet
    • list of prefixes
    • list of suffixes
    • alphabet for subscripts
    • alphabet for superscript and
    • list of ignored characters.
  • A dictionary (dictionaries are optional, so a language can have one, but recognition will also “work” without one.)

Language Auto-Detection

  • ABBYY technologies are able to detect language of a document automatically.
  • The product chooses the best matching language from a group pre-defined group of languages.
  • This group can be set/edited by the user/developer.

(FineReader Engine 11 and following versions contain new capabilities to work with multi-language documents)

 

 

 

 

Was this article helpful?

4 out of 22 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.