Community

Can ABBYY Finereader OCR Latin macrons?

In the Latin language, long vowels often have a macron over it: "arma virumque canō, Troiae quī prīmus ab ōrīs"

I indicated that the input language of my pdf is Latin, but the macrons are lost in the output sent to Word: "arma virumque cano, Troiae qui primus ab oris"

Is there a way to retain the macrons? Many thanks!

Was this article helpful?

1 out of 1 found this helpful

Comments

4 comments

  • Avatar
    IvanPopov

    If you are using FineReader Engine, it is possible to recognize all Unicode characters. However, you might first need to define your own language with a custom alphabet that includes all necessary symbols. CustomLanguage sample found in the Code Samples Library might help you do just that. If you look at the source code of that sample, you will find a function called makeTextLanguage(). If you modify it to look something like this, it will return a TextLanguage object with a custom alphabet (I use C# syntax in the code fragments below):

    private FREngine.TextLanguage makeTextLanguage()
    {
    FREngine.LanguageDatabase languageDatabase = engineLoader.Engine.CreateLanguageDatabase();
    FREngine.TextLanguage textLanguage = languageDatabase.CreateTextLanguage();
    // Copy all attributes from the predefined Latin language
    FREngine.TextLanguage latinLanguage = engineLoader.Engine.PredefinedLanguages.Find("Latin").TextLanguage;
    textLanguage.CopyFrom( latinLanguage );
    textLanguage.InternalName = "SampleTextLanguage";
    // Add necessary symbols to the first (and single) BaseLanguage object within TextLanguage
    FREngine.BaseLanguage baseLanguage = textLanguage.BaseLanguages[0];
    baseLanguage.InternalName = "SampleBaseLanguage";
    baseLanguage.set_LetterSet( FREngine.BaseLanguageLetterSetEnum.BLLS_Alphabet, baseLanguage.get_LetterSet(FREngine.BaseLanguageLetterSetEnum.BLLS_Alphabet) + "ĀĒĪŌŪȲāēīōūȳ" );
    return textLanguage;
    }
    

    You should then pass this custom language to a recognizing method, e.g. as a Process() of a FRDocument object:

    FREngine.FRDocument document = engineLoader.Engine.CreateFRDocument();
    …
    // Create a custom TextLanguage
    FREngine.TextLanguage textLanguage = makeTextLanguage();
    // Pass your custom language to the Process() method
    FREngine.DocumentProcessingParams documentProcessingParams = engineLoader.Engine.CreateDocumentProcessingParams();
    documentProcessingParams.PageProcessingParams.RecognizerParams.TextLanguage = textLanguage;
    document.Process( documentProcessingParams );
    

    This should allow you to recognize macrons. If you need to recognize symbols other than macrons, make sure to add them to the alphabet of your custom language as well. To improve recognition of these symbols further, you might also try recognizing with training (see Developer’s Help -> Guided Tour -> Advanced Techniques -> Using GUI Elements -> Recognizing with Training).

    0
  • Avatar
    GEORGEJUNG

    Hi! I cannot work out the following issue: I need to OCR texts containing pinyin diacritics ( o ā ɑ̄ ē ī ō ū ǖ / Ā Ē Ī Ō Ū Ǖ /á ɑ́ é í ó ú ǘ / Á É Í Ó Ú Ǘ / ǎ ɑ̌ ě ǐ ǒ ǔ ǚ / Ǎ Ě Ǐ Ǒ Ǔ Ǚ / à ɑ̀ è ì ò ù ǜ / À È Ì Ò Ù Ǜ / a ɑ e i o u ü / A E I O U o ā ɑ̄ ē ī ō ū ǖ / á ɑ́ é í ó ú ǘ /ǎ ɑ̌ ě ǐ ǒ ǔ ǚ / à ɑ̀ è ì ò ù ǜ / a ɑ e i o u ü) which the software either does not recognize or even mix up. In previous versions of such software, and of the others included in the comparison table, I tried training, creating user specific languages, adding every character to their dictionaries etc, finding no success at all. I've even asked the companies for a solution which seems not to exist. Therefore, I think this situation should really be mentioned as the Achilles’ heel in the OCR field. I would really appreciate some advice on how to solve this problem if possible or even to be corrected if I am wrong.

    0
  • Avatar
    IvanPopov

    Recognition quality depends heavily on the quality of both printing and scanning of the original document. For example, if diacritic symbols are too small, they might be treated as garbage and not regarded as meaningful content. Similarly, printing defects might lead to different diacritics being treated as the same one. Therefore, more often than not OCR results are as good as the images that are recognized. So far, as time-consuming and mundane as it may be, pattern training and custom dictionaries are still the best way to improve OCR results.

    0
  • Avatar
    IvanPopov

    Do we understand correctly, that you are using FineReader 12? In that case, you can contact FineReader support team with this question. If you are using OCR SDK, you should contact SDK support team instead. You can find their contact information on this page: http://www.abbyy.com/support/contacts/

    0

Please sign in to leave a comment.