I am processing mobile photos.
I have noticed that the "confidence" attribute is not provided for chars when I use processImage. Is this only provided for processing text fields?
Also, I usually get the best results for processImage with a profile of "documentConversion" -- this usually includes correct text, and skips incorrect text. When I switch to a "textExtraction" profile I expect better text, but instead it just adds a lot of noise. Is this unexpected?
Comments
4 comments
The only format that allows getting confidence information for processImage is xml. So you need to parse xml and there will be "suspicious="1"" attribute for uncertain characters.
E.g.:
The "textExtraction" profile is optimized to extract as much text from document as possible. The text after recognition is intended to be used in search scenarios. E.g. when you need to add some image to full-text search database. After that you can find the document by typing one or more words from it. So it is usual to get more noise because noise is not considered very harmful in this scenario.
The "documentConversion" profile is optimized for text reuse. It allows reconstruction of page layout, formatting and other page elements. That is why it is default processing profile.
Thanks for your answer, that is helpful. Regarding confidence, I am wondering about the difference between "suspicious" and "confidence." In your example here you provide confidence as a number between 1 and 100:
http://ocrsdk.com/documentation/quick-start/text-fields/
However, suspicious seems to be 1 or not-present. What is the reason for the difference?
"Suspicous" is a bit-flag. It is either present or not. If it is present, it means recognition engine is not sure whether the recognition of it was correct.
Confidence is int from 1 to 100. It represents the amount of similarity between recognized character and how recognizer expects it too look.
"Confidence" attribute is quite confusing, we have plans to replace it with "suspicious" in all text-field processing.
How feasible is it to annotate PDF output with confidence metrics? For example, by producing both XML and PDF, may one reasonably extract low confidence ranges from XML and figure out where this attribute should be inserted into PDF? Do I assume correctly that XML tells you on just what page text appears (not where on page)...or does layout analysis break down page into text blocks so recognition confidence issues will be associated with a text block? Thanks for any help.
Please sign in to leave a comment.