Community

Regular Expressions in arabic

Written by Permanently deleted user

September 20, 2017 09:36
12

Hello,

how can i write regular expressions for the arabic language, precisely numbers with a . and a comma? ex: 1,021.00 ( i used the processFields method with the regex in the xml file) i tried to write it in Unicode but it doesn't seem to work. Here s the regex:

[U\+0660-U\+0669]+,[U\+0660-U\+0669]+\.[U\+0660-U\+0669]+

when i use this regex it recognized the point and comma but for the digits the were only some "<<>><>"

Thanks :)

Was this article helpful?

0 out of 0 found this helpful

Comments

12 comments

Permanently deleted user

September 21, 2017 13:19
Please try the following regular expression:

(\d{1,3})((\,\d{3})*)(\.\d{1,2})?

Also I would recommend you to limit the characters which should be used during recognition using the Digits recognition language and the letterSet element, for example:

<language>Digits</language>

<textType>normal</textType>

<letterSet>0123456789.,</letterSet>

0
Permanently deleted user

September 21, 2017 13:32
The Regex101 service should be helpful for checking your regular expressions.

0
Permanently deleted user

September 22, 2017 09:20
Hey,

For the digits in arabic ( ٠١٢٣٤٥٦٧٨٩ ) it doesn't work. I did replaced the normal digits with the arabic ones and no luck either.

0
Permanently deleted user

September 22, 2017 10:11
Then please try the following settings:

<language>Arabic</language>

<textType>normal</textType>

<letterset>٠١٢٣٤٥٦٧٨٩,.</letterset>

<regExp>(\p{N}{1,3})((\,\p{N}{3})*)(\.\p{N}{1,2})?</regExp>

0
Permanently deleted user

October 02, 2017 15:13
Nope, doesn't work, where there is a comma the ocr sees it as a م in most cases

0
Permanently deleted user

October 02, 2017 15:32
Also could you suggest any solution for the same problem but using Abbyy FineReader 14. I've tried pattern training and also creating a new custom language with regular expressions, to work alongside the programs arabic language.

The problem is the program not makind the difference between a point ( . ) and a zero ( witch in arabic is ٠ ).

Thank you!

0
Permanently deleted user

October 03, 2017 06:36
Could you please share your images for which the issues can be reproduced?

0
Permanently deleted user

October 03, 2017 07:40
this would be the image, all the others have the same template, the only issue is that the output doesn't make a difference between 0 and point, besides this it is very accurate.

0
Permanently deleted user

October 03, 2017 07:41
the pdf has a better quality tho, an example of a number would be this:

0
Permanently deleted user

October 03, 2017 12:42
Thank you for this information!

Regarding the first image, its quality is very poor, it cannot be used for OCR. The resolution of the image is low, the image is blurred, the text is fuzzy. Even human eyes cannot read the text from it. Possibly the image has worsened during enclosing to this post.

If you manage to improve the quality of the input images in accordance with the Best Practices article, the recognition results might be better. Please try it.

Concerning the second text fragment, I will test it in Cloud OCR SDK and write you later. The support specialists of ABBYY desktop products should send you some recommendations about using FR 14 by email.

0
Permanently deleted user

October 12, 2017 12:37
Sorry for the delay. I've reproduced the issue with a point and an Arabic zero using the image fragment. I've created the corresponding reclamation and sent the information to our R&D Department for further investigation. This is really a difficult case, because these characters are very similar, and our OCR technology mixes them up.

The regular expressions do not help, because they do not strictly limit the set of characters of the output result, i.e. the recognized value may contain characters which are not included into the regular expression. During recognition all hypotheses of a word recognition are checked against the specified regular expression. If a given recognition variant conforms to the expression, it has higher probability of being selected as final recognition output. But if there is no variant that matches regular expression, the result will not conform to the expression.

0
Permanently deleted user

October 12, 2017 13:26
Thank you very much for the answer and effort Oksana. I'm looking forward to your solution! :)

0

Please sign in to leave a comment.

Community

Regular Expressions in arabic

Was this article helpful?

Comments

Didn't find what you were looking for?