Community

Some parts of a specific PDF are not OCR-ed by ABBYY FineReader Engine Answered

Written by Permanently deleted user

September 08, 2017 12:31
3

We received a PDF from a supplier that needs to be OCR-ed: `esco_original.pdf`

After processing with Java through the `com.abbyy.FREngine`-API not all parts of the PDF are OCR-ed

See: esco_abbyy.pdf

F.Y.I. we use LoadPredefinedProfile("DocumentConversion_Accuracy")

When I manually print the PDF, scan it and process it through the API again, it is fully OCR-ed

See: esco_rescanned.pdf and the fully-OCR-ed variant: esco_rescanned_abbyy.pdf

The question is: why is the original PDF not fully OCR-ed by ABBYY ?

Was this article helpful?

0 out of 0 found this helpful

Comments

3 comments

Nikolay Krivchanskiy

September 20, 2017 15:51
Hi Koen,

There is a number of methods to improve recognition quality in FineReader Engine. For example we managed to achieve much better results, setting options ObjectsExtractionParams::EnableAggressiveTextExtraction, ObjectsExtractionParams::DetectTextOnPictures to true.

Aдso you should manually set recognition language or languages of the document you are recognizing. You can do this with RecognizerParams::SetPredefinedTextLanguage.

For more information about object extraction options, please refer to Help → API Reference → Parameter Objects → Preprocessing, Analysis, Recognition, and Synthesis Parameters → ObjectsExtractionParams.

0
Permanently deleted user

September 20, 2017 23:13
Hi Nikolay

Thank you for your reply,

Just to make sure: my issue is not about the text-extraction itself, but about the way a PDF is OCR-ed.

The PDF returned from the PDFExport-module of ABBYY Finereader Engine will be processed by our own software,

and it is that that PDF is not fully OCR-ed.

0
Permanently deleted user

September 25, 2017 12:39
Hi Nikolay

I eventually figured it out with your reply.

This actually worked for me:

            IDocumentProcessingParams dpp = engine.CreateDocumentProcessingParams();
            dpp.getPageProcessingParams().getPagePreprocessingParams().setCorrectOrientation(true);
            dpp.getPageProcessingParams().getObjectsExtractionParams().setEnableAggressiveTextExtraction(true);
            dpp.getPageProcessingParams().getObjectsExtractionParams().setDetectTextOnPictures(true);

Thanks !

1

Please sign in to leave a comment.

Community

Some parts of a specific PDF are not OCR-ed by ABBYY FineReader Engine Answered

Was this article helpful?

Comments

Didn't find what you were looking for?