ABBYY OCR technology can process more than 200 OCR languages of different types:
- Natural languages, like English, Russian or German - but also languages with specific writing like Chinese (PRC and Taiwan), Japanese, Korean and Korean/Hangul, Thai, Hebrew, Arabic
- Artificial languages: Esperanto, Interlingua, Ido, Occidental
- Programming Languages: Basic, C/C++, COBOL, Fortran, Java, Pascal, Simple chemical formulas
Languages contain special language units or data types, e.g.:
- Addresses
- Date and time
- Names, etc.
- For some natural languages: City, village, settlement (English, United Kingdom); Currency in words (English, United States), etc.
The languages mentioned above are so-called 'predefined languages'. In addition, it is possible to define own languages and use them for recognition.
The screenshot below shows the “Language Editor” implementation of FineReader PDF, the desktop application for individual productivity.
Structure of a Recognition Language
Every recognition language has the following properties:
-
Name
-
Set of allowed characters:
-
alphabet
-
list of prefixes
-
list of suffixes
-
alphabet for subscripts
-
alphabet for superscript and
-
list of ignored characters.
-
-
A dictionary (dictionaries are optional, so a language can have one, but recognition will also “work” without one.)
Language Auto-Detection
-
ABBYY technologies are able to detect language of a document automatically.
-
The product chooses the best matching language from a group pre-defined group of languages.
-
This group can be set/edited by the user/developer.
(FineReader Engine 11 and following versions contain new capabilities to work with multi-language documents)
Comments
0 comments
Please sign in to leave a comment.