Community

Cloud OCR SDK - how to disable dictionary "auto correction" Answered

Hello,

I'm trying to recognize a text field (using http://ocrsdk.com/documentation/apireference/processTextField/ )  which has simple word written in it in a very clear font:

"TRKA"

However, if language is set to Czech, it gets automatically "corrected" to "TRKÁ".

If is set language to English no auto correction occurs.

How do I disable dictionary based auto corrections or at least extract confidence of characters before dictionary pass from cloud OCR?

 

This feature is very annoying and currently makes it impossible for us to use ABBYY for OCR.

 

Thank you.  

Was this article helpful?

0 out of 0 found this helpful

Comments

5 comments

  • Avatar
    Oksana Serdyuk

    Return of a collection of variants of character recognition and their confidence is supported only for full-page recognition for the XML export format. This is not supported for the field-level recognition mode.

    When you use the English language, you do not get the "Á" character as a result, because the English alphabet does not include this letter. Thus, the same you can do with the Czech recognition language. For example, you can limit the characters which should be used during recognition using the letterSet parameter. Additionally, you could check potential field values against appropriate regular expressions (e.g. date format for dates, words starting with capital letters for names, etc.) to accept or reject different variants. To specify the regular expression, which defines what words are allowed in the field, please use the regExp parameter of the processTextField method. Please also pay your attention on the How to Recognize Text Fields article with some details.

    0
  • Avatar
    Licho

    Thank, you, but Czech alphabet normally contains characters like "Á", I cannot restrict it.

    In this case, the scanned image DOES NOT contain it, but is auto corrected using dictionary to version with "Á".

    If I use English language, and set allowed characters to contain "Á", the word is still OCR correctly as "TRKA" and other words which really have "Á" work correctly too.

     

    So the BUG is obviously in too eager dictionary checks which convert word "TRKA" to "TRKÁ" for no reason at all

     

    To illustrate:

    The field input: http://i.imgur.com/FsyeaRv.png

    OCR with Czech language, result: "TRKÁ" (fail)  "Á" confidence 100! even if its not there.

    OCR with English language, with "Á" in alphabet, result: "TRKA" (correct)

     

    The field input: http://i.imgur.com/4cSSlvs.png

    OCR with Czech language, result: "MALEGOVÁ" (correct)

    OCR with English language, with "Á" in alphabet, result: "MALEGOVÁ" (correct)

     

    So it is obvious some dictionary based processing transforms correctly OCR word to something else. "TRKA" is a surname (not in dictionary) while "TRKÁ" is verb that is likely to be in a dictionary.

     

    0
  • Avatar
    Oksana Serdyuk

    Thank you for the images! I have reproduced the issue, we shall analyze the situation, and then I will return with our comments.

    0
  • Avatar
    Oksana Serdyuk

    Sorry for the delay. Do I understand correctly that you process passports and the fields are always printed by the capital letters? If so, please try to use the following recognition settings for the processTextField method:

    Language = "Czech";

    TextType = TextType.Normal; //Use the TextType.OcrB for extracting MRZ data

    Letterset = "ABCDEFGHIJKLMNOPQRSTUVWXYZÁÉÍÓÚÝČĎĚŇŘŠŤŮŽ";

    In this case both your images are recognized accurately.

     

     

    0
  • Avatar
    Licho

    Thank you!   

     

    0

Please sign in to leave a comment.