Hello,
When ObjectsExtractionParams::SourceContentReuseMode is set to CRM_Auto, under which circumstances does it decide to reuse context vs performing OCR. I notice that even when the source PDF contains text content, seems to always OCR.
Thank you,
Juan
Comments
1 comment
Hi Juan,
In general, the algorithm for each text block starts to OCR and trying to check whether the pdf text layer is reliable. After recognizing enough information in case that symbols in the text layer are similar to the recognized symbols, the rest of text block is taken from the text layer. Otherwise, it continues to OCR further.
However, some fonts are difficult to understand even for a human eye, so the Engine may choose the second option.
If you're sure that the text layer of your document is correct you may immediately select the CRM_ContentOnly option.
Please sign in to leave a comment.