Community

Regular Expressions in arabic

Hello,

how can i write regular expressions for the arabic language, precisely numbers with a . and a comma? ex: 1,021.00 ( i used the processFields method with the regex in the xml file) i tried to write it in Unicode but it doesn't seem to work. Here s the regex:

 [U\+0660-U\+0669]+,[U\+0660-U\+0669]+\.[U\+0660-U\+0669]+

when i use this regex it recognized the point and comma but for the digits the were only some "<<>><>"

Thanks :) 

 

Was this article helpful?

0 out of 0 found this helpful

Comments

12 comments

  • Avatar
    Oksana Serdyuk

    Please try the following regular expression:

    (\d{1,3})((\,\d{3})*)(\.\d{1,2})?

    Also I would recommend you to limit the characters which should be used during recognition using the Digits recognition language and the letterSet element, for example:

    <language>Digits</language>

    <textType>normal</textType>

    <letterSet>0123456789.,</letterSet>

    0
  • Avatar
    Oksana Serdyuk

    The Regex101 service should be helpful for checking your regular expressions.

    0
  • Avatar
    Andreea OLaru

    Hey, 

    For the digits in arabic ( ٠١٢٣٤٥٦٧٨٩ ) it doesn't work. I did replaced the normal digits with the arabic ones and no luck either.

    0
  • Avatar
    Oksana Serdyuk

    Then please try the following settings:

    <language>Arabic</language>

    <textType>normal</textType>

    <letterset>٠١٢٣٤٥٦٧٨٩,.</letterset>

    <regExp>(\p{N}{1,3})((\,\p{N}{3})*)(\.\p{N}{1,2})?</regExp>

    0
  • Avatar
    Andreea OLaru

    Nope, doesn't work, where there is a comma the ocr sees it as a م in most cases 

    0
  • Avatar
    Andreea OLaru

    Also could you suggest any solution for the same problem but using Abbyy FineReader 14. I've tried pattern training and also creating a new custom language with regular expressions, to work alongside the programs arabic language.

    The problem is the program not makind the difference between a point ( . ) and a zero ( witch in arabic is ٠ ).

    Thank you!

    0
  • Avatar
    Oksana Serdyuk

    Could you please share your images for which the issues can be reproduced?

    0
  • Avatar
    Andreea OLaru

     

    this would be the image, all the others have the same template, the only issue is that the output doesn't make a difference between 0 and point, besides this it is very accurate.

    0
  • Avatar
    Andreea OLaru

    the pdf has a better quality tho, an example of a number would be this:

     

    0
  • Avatar
    Oksana Serdyuk

    Thank you for this information!

    Regarding the first image, its quality is very poor, it cannot be used for OCR. The resolution of the image is low, the image is blurred, the text is fuzzy. Even human eyes cannot read the text from it. Possibly the image has worsened during enclosing to this post.

    If you manage to improve the quality of the input images in accordance with the Best Practices article, the recognition results might be better. Please try it.

    Concerning the second text fragment, I will test it in Cloud OCR SDK and write you later. The support specialists of ABBYY desktop products should send you some recommendations about using FR 14 by email.

    0
  • Avatar
    Oksana Serdyuk

    Sorry for the delay. I've reproduced the issue with a point and an Arabic zero using the image fragment. I've created the corresponding reclamation and sent the information to our R&D Department for further investigation. This is really a difficult case, because these characters are very similar, and our OCR technology mixes them up.

    The regular expressions do not help, because they do not strictly limit the set of characters of the output result, i.e. the recognized value may contain characters which are not included into the regular expression. During recognition all hypotheses of a word recognition are checked against the specified regular expression. If a given recognition variant conforms to the expression, it has higher probability of being selected as final recognition output. But if there is no variant that matches regular expression, the result will not conform to the expression.

     

    0
  • Avatar
    Andreea OLaru

    Thank you very much for the answer and effort Oksana.  I'm looking forward to your solution! :) 

    0

Please sign in to leave a comment.