Measurement of OCR accuracy and facts we should know
-
Optical Character Recognition (OCR) is executed in multiple steps and every one of of them influences the accuracy level that is achieved at the end of the process.
-
Image quality does matter. Images with higher quality are 'easier to process'
Accuracy on a Character Level
-
OCR technology providers typically measure the accuracy of the optical character recognition results on a 'character level'.
Examples:
- 99% accuracy means that
-
-
1 out of 100 characters
or -
10 out of 1000 characters are recognized “uncertain”
-
-
- 99,9% accuracy means that
-
-
1 out of 1000 characters are recognized “uncertain”
-
-
-
An character that was recognized as “uncertain” can still be correctly recognized. This means thaat the core OCR technology is not able to make a final decision - even after applying all built-in classifiers, AI technologies and internal voting algorithms.
-
In a real scenario, a person that is manually reviewing the results might be the final instance to decide what is wrong or right.
-
OCR scientists as well as ABBYY develop, change and optimize the recognition technology regualrly. An increase or a decrease in the resulting recognition accuracy can only be measured against a test set of document images, where the text is known and contains no mistakes.
-
Under real conditions, the absolute measurement might be a challenge to calculate, as it is sometimes difficult to guarantee that the testing batch is 100% free of mistakes.
OCR Accuracy on a Word Level
-
Instead of meassuring the character recognition quality (recognition on the character level'), it is also possible to measure the accuracy on a world level.
-
This approach is often used in environments, where the right words should be found, for example searching for a name in a book or registration documents.
-
'Word level accuracy' is not easy to measure as several aspects must be considered:
-
Relevant words like: person names, city names, etc.
-
Non relevant words“ like: “the”, “and” etc.
-
-
Also it has to be considered that
-
Most of the time there is no ground truth data on fulltext OCR scenarios on a word level.
Only in data extraction scenarios (like forms processing) it is much easier to work with word list and database look ups. -
“Simple” search algorithms, that only find the exact match, are not practical enough to get a proper search result - no matter if OCR uncertainty is in the document set or not. It is better to use a more intelligent, “fuzzy” search technology.
-
In OCR scenarios for historic materials the challenge is that there is no unified grammar and spelling.
-
Ways to improve the OCR Accuracy
There are different ways how we can influence the accuracy of the recognition process within the ABBYY OCR SDKs:
-
Image quality
-
Images for OCR have to full-fill a certain quality level. In a nutshell:
-
300 DPI – more: OCR - Optimal Image Resolution
-
grey-scale images are better than black and white,
-
color images can improve OCR, but mostly the export documents, like the searchable PDFs, should be in color
-
The images of the documents should be sharp, flat and not proper oriented, so that there are straight text lines.
-
-
-
Image pre-processing
-
It is important to prepare the images for the OCR process, otherwise the results will stay much behind the achievable results, for example
-
-
Layout Analysis
-
Before the characters can be recognized, it is important the zones for OCR (region of interest) is detected or defined. This process is very easy for a human, but a tough job for algorithms. If you miss a text zone on a page - then it will not be OCRed and at the end if you measure “OCR accuracy” then it also has to be considered that “lost” text can not be wrong - but at the end loosing text might be much worse than having the full text with a few more uncertain characters
-
-
Character Recognition
-
This in a insider topic - where only ABBYY can work on. Here some more details what is about
-
-
Language & Character Settings
-
Knowing what languages and characters are used in the document helps to increase the accuracy rate. More on this topic: OCR Recognition Languages
-
-
Use of word lists
-
The ABBYY SDKs provide an API to use custom word lists, but in broad, mass OCR conversion, the use of dictionaries delivers better results.
-
-
Use of (Morphology) Dictionaries
-
ABBYY SDKs allow to work with dictionaries that are included in the SDK, but it is also possible to create new custom dictionaries.
-
-
Verification
-
Most of the time verification involves human interaction
-
Image quality can be checked during the scan process
-
The results of the automated layout analysis can be verified before the text recognition is performed, this is recommended when documents should be transformed into editable office formats or e-books.
-
The OCR results can be checked and corrected before the final document export takes place
-
-
-
Post correction
-
The ABBYY XML output gives “low level” access to the OCR results. They can be parsed, changed and then also be transformed into other formats - more details: ABBYY XML Export
-
Comments
0 comments
Please sign in to leave a comment.