We received a PDF from a supplier that needs to be OCR-ed: `esco_original.pdf`
After processing with Java through the `com.abbyy.FREngine`-API not all parts of the PDF are OCR-ed
See: esco_abbyy.pdf
F.Y.I. we use LoadPredefinedProfile("
When I manually print the PDF, scan it and process it through the API again, it is fully OCR-ed
See: esco_rescanned.pdf and the fully-OCR-ed variant: esco_rescanned_abbyy.
The question is: why is the original PDF not fully OCR-ed by ABBYY ?
Comments
3 comments
Hi Koen,
There is a number of methods to improve recognition quality in FineReader Engine. For example we managed to achieve much better results, setting options ObjectsExtractionParams::EnableAggressiveTextExtraction, ObjectsExtractionParams::DetectTextOnPictures to true.
Aдso you should manually set recognition language or languages of the document you are recognizing. You can do this with RecognizerParams::SetPredefinedTextLanguage.
For more information about object extraction options, please refer to Help → API Reference → Parameter Objects → Preprocessing, Analysis, Recognition, and Synthesis Parameters → ObjectsExtractionParams.
Hi Nikolay
Thank you for your reply,
Just to make sure: my issue is not about the text-extraction itself, but about the way a PDF is OCR-ed.
The PDF returned from the PDFExport-module of ABBYY Finereader Engine will be processed by our own software,
and it is that that PDF is not fully OCR-ed.
Hi Nikolay
I eventually figured it out with your reply.
This actually worked for me:
Thanks !
Please sign in to leave a comment.