Community

Some parts of a specific PDF are not OCR-ed by ABBYY FineReader Engine Answered

We received a PDF from a supplier that needs to be OCR-ed: `esco_original.pdf`

 

After processing with Java through the `com.abbyy.FREngine`-API not all parts of the PDF are OCR-ed

See: esco_abbyy.pdf

F.Y.I. we use LoadPredefinedProfile("DocumentConversion_Accuracy")

 

When I manually print the PDF, scan it and process it through the API again, it is fully OCR-ed

See: esco_rescanned.pdf and the fully-OCR-ed variant: esco_rescanned_abbyy.pdf

 

The question is: why is the original PDF not fully OCR-ed by ABBYY ?

Was this article helpful?

0 out of 0 found this helpful

Comments

3 comments

  • Avatar
    Nikolay Krivchanskiy

     Hi Koen,

    There is a number of methods to improve recognition quality in FineReader Engine. For example we managed to achieve much better results, setting options ObjectsExtractionParams::EnableAggressiveTextExtraction, ObjectsExtractionParams::DetectTextOnPictures to true.

    Aдso you should manually set recognition language or languages of the document you are recognizing. You can do this with RecognizerParams::SetPredefinedTextLanguage.

    For more information about object extraction options, please refer to Help → API Reference → Parameter Objects → Preprocessing, Analysis, Recognition, and Synthesis Parameters → ObjectsExtractionParams.

        

    0
  • Avatar
    Permanently deleted user

    Hi Nikolay

    Thank you for your reply,

    Just to make sure: my issue is not about the text-extraction itself, but about the way a PDF is OCR-ed.

    The PDF returned from the PDFExport-module of ABBYY Finereader Engine will be processed by our own software,

    and it is that that PDF is not fully OCR-ed.

    0
  • Avatar
    Permanently deleted user

    Hi Nikolay

    I eventually figured it out with your reply.

    This actually worked for me:

                IDocumentProcessingParams dpp = engine.CreateDocumentProcessingParams();   
                dpp.getPageProcessingParams().getPagePreprocessingParams().setCorrectOrientation(true);
                dpp.getPageProcessingParams().getObjectsExtractionParams().setEnableAggressiveTextExtraction(true);
                dpp.getPageProcessingParams().getObjectsExtractionParams().setDetectTextOnPictures(true);

    Thanks !

    1

Please sign in to leave a comment.