OCR Accuracy and its Measurement

Measurement of OCR accuracy and facts we should know

  • Optical Character Recognition (OCR) is executed in multiple steps and every one of of them influences the accuracy level that is achieved at the end of the process.

  • Image quality does matter. Images with higher quality are 'easier to process'

Accuracy on a Character Level

  • OCR technology providers typically measure the accuracy of the optical character recognition results on a 'character level'.
Examples:
  • 99% accuracy means that
      • 1 out of 100 characters
        or
      • 10 out of 1000 characters are recognized “uncertain”
  • 99,9% accuracy means that
      • 1 out of 1000 characters are recognized “uncertain”
  • An character that was recognized as “uncertain” can still be correctly recognized. This means thaat the core OCR technology is not able to make a final decision - even after applying all built-in classifiers, AI technologies and internal voting algorithms.
  • In a real scenario, a person that is manually reviewing the results might be the final instance to decide what is wrong or right.
  • OCR scientists as well as ABBYY develop, change and optimize the recognition technology regualrly. An increase or a decrease in the resulting recognition accuracy can only be measured against a test set of document images, where the text is known and contains no mistakes.
  • Under real conditions, the absolute measurement might be a challenge to calculate, as it is sometimes difficult to guarantee that the testing batch is 100% free of mistakes.

OCR Accuracy on a Word Level

  • Instead of meassuring the character recognition quality (recognition on the character level'), it is also possible to measure the accuracy on a world level.
  • This approach is often used in environments, where the right words should be found, for example searching for a name in a book or registration documents.
  • 'Word level accuracy' is not easy to measure as several aspects must be considered:
    • Relevant words like: person names, city names, etc.
    • Non relevant words“ like: “the”, “and” etc.
  • Also it has to be considered that
    • Most of the time there is no ground truth data on fulltext OCR scenarios on a word level.
      Only in data extraction scenarios (like forms processing) it is much easier to work with word list and database look ups.
    • “Simple” search algorithms, that only find the exact match, are not practical enough to get a proper search result - no matter if OCR uncertainty is in the document set or not. It is better to use a more intelligent, “fuzzy” search technology.
    • In OCR scenarios for historic materials the challenge is that there is no unified grammar and spelling.

Ways to improve the OCR Accuracy

There are different ways how we can influence the accuracy of the recognition process within the ABBYY OCR SDKs:

  • Image quality
    • Images for OCR have to full-fill a certain quality level. In a nutshell:
      • 300 DPI – more: OCR - Optimal Image Resolution
      • grey-scale images are better than black and white,
      • color images can improve OCR, but mostly the export documents, like the searchable PDFs, should be in color
      • The images of the documents should be sharp, flat and not proper oriented, so that there are straight text lines.
  • Image pre-processing
    • It is important to prepare the images for the OCR process, otherwise the results will stay much behind the achievable results, for example
  • Layout Analysis
    • Before the characters can be recognized, it is important the zones for OCR (region of interest) is detected or defined. This process is very easy for a human, but a tough job for algorithms. If you miss a text zone on a page - then it will not be OCRed and at the end if you measure “OCR accuracy” then it also has to be considered that “lost” text can not be wrong - but at the end loosing text might be much worse than having the full text with a few more uncertain characters
  • Language & Character Settings
    • Knowing what languages and characters are used in the document helps to increase the accuracy rate. More on this topic: OCR Recognition Languages
  • Use of word lists
    • The ABBYY SDKs provide an API to use custom word lists, but in broad, mass OCR conversion, the use of dictionaries delivers better results.
  • Use of (Morphology) Dictionaries
    • ABBYY SDKs allow to work with dictionaries that are included in the SDK, but it is also possible to create new custom dictionaries.
  • Verification
    • Most of the time verification involves human interaction
      • Image quality can be checked during the scan process
      • The results of the automated layout analysis can be verified before the text recognition is performed, this is recommended when documents should be transformed into editable office formats or e-books.
      • The OCR results can be checked and corrected before the final document export takes place
  • Post correction
    • The ABBYY XML output gives “low level” access to the OCR results. They can be parsed, changed and then also be transformed into other formats - more details: ABBYY XML Export

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.