Hi
We are currently using ABBYY FineReader Engine 11.1.14.707470 with Linux with the Java-API (com.abbyy.FREngine.jar).
Almost all PDFs are processed correctly, but when OCR-ing the attached PDF it becomes unreadable
We use the folllowing code to perform the OCR
import com.abbyy.FREngine.Engine;
import com.abbyy.FREngine.FileExportFormatEnum;
import com.abbyy.FREngine.IDocumentProcessingParams;
import com.abbyy.FREngine.IEngine;
import com.abbyy.FREngine.IFRDocument;
import com.abbyy.FREngine.IFRPage;
import com.abbyy.FREngine.IFRPages;
import com.abbyy.FREngine.IPDFExportParams;
import com.abbyy.FREngine.PDFExportScenarioEnum;
public class ABBYY {
public ABBYY() {}
private IEngine engine = null;
public void Run(String inputfilename, String dllFolder, String developerSn, String languages) throws Exception {
// Load ABBYY FineReader Engine
engine = Engine.GetEngineObject(dllFolder, developerSn);
try {
// Setup ABBYY FineReader Engine
String profile = "DocumentConversion_Accuracy";
engine.LoadPredefinedProfile(profile);
// Process PDF
processPDF(inputfilename, languages);
} catch (Exception ex) {
ex.printStackTrace();
} finally {
// Unload ABBYY FineReader Engine
engine = null;
Engine.DeinitializeEngine();
}
}
private void processPDF(String inputfilename, String languages) {
String imagePath = inputfilename;
try {
// Create document
IFRDocument document = engine.CreateFRDocument();
/*
If orientation detection is performed during document processing
(IPagePreprocessingParams::CorrectOrientation property is TRUE), you can select fast
orientation detection mode: set the OrientationDetectionMode property of the
OrientationDetectionParams object to ODM_Fast.
*/
IDocumentProcessingParams dpp = engine.CreateDocumentProcessingParams();
dpp.getPageProcessingParams().getPagePreprocessingParams().setCorrectOrientation(true);
// Agressive text-selection
dpp.getPageProcessingParams().getObjectsExtractionParams().setEnableAggressiveTextExtraction(true);
dpp.getPageProcessingParams().getObjectsExtractionParams().setDetectTextOnPictures(true);
// Set language
dpp.getPageProcessingParams().getRecognizerParams().SetPredefinedTextLanguage(languages);
dpp.getPageProcessingParams().getRecognizerParams().setLanguageDetectionMode(com.abbyy.FREngine.ThreeStatePropertyValueEnum.TSPV_Yes);
try {
// Add image file to document
document.AddImageFile( imagePath, null, null );
// Remove empty pages from inputfile
boolean hasEmptyPages = false;
IFRPages pages = document.getPages();
for (int p = (pages.getCount() - 1); p >= 0; p--) {
IFRPage page = pages.getElement(p);
if (page.IsEmptyEx(null, null, null)) {
pages.DeleteAt(p);
hasEmptyPages = true;
}
}
if (hasEmptyPages) document.Synthesize(null);
// Process document
document.Process(dpp);
// Save results to pdf using 'balanced' scenario
IPDFExportParams pdfParams = engine.CreatePDFExportParams();
pdfParams.setScenario( PDFExportScenarioEnum.PES_Balanced );
/*
Specifies whether a linearized PDF file should be created. Linearized PDF files have internal data
arranged in a page order. A page of a linearized PDF file can be read in a web browser plug-in
without waiting for the whole file to be downloaded. Non-linearized PDFs have the data
necessary to assemble a document page scattered through the whole file. Non-linearized
PDF files are smaller, but they are slower to access.
Note: This property makes sense only for multipage PDF files. If the property is set to TRUE and
a one-page document is exported, a nonlinearized file is created.
This property is FALSE by default.
*/
pdfParams.getPDFFeatures().setEnableLinearization(true);
String pdfExportPath = inputfilename + "_ocrred.pdf";
document.Export( pdfExportPath, FileExportFormatEnum.FEF_PDF, pdfParams );
} finally {
// Close document
document.Close();
}
} catch( Exception ex ) {
ex.printStackTrace();
}
}
}
Which parameters do we need to set in our Java-code to prevent this issue?
Any suggestions within the settings of FREngine itself?
Or is this a known issue in FREngine 11 and to be or already fixed in a more recent version?
Many thanks in advance
Koen de Leijer
- d52129d5-4d7c-4753-abf1-a9ec0099af28_f201900257.pdf
- c206cec8-3946-4e70-b76f-a9ec0099b87d_f201900257-ocrred.pdf
- 7e0b3e77-b13a-4851-b1c5-a9ec009f3e6a_623561be-8044-421e-af47-62afbae9c2f7.pdf
- fb121638-e046-4fb2-8fea-a9ec009f4c75_623561be-8044-421e-af47-62afbae9c2f7.pdf-ocrred.pdf
- 6b544ec6-114b-42b8-a280-a9ec009f87b9_f366540.pdf
- 090d40a9-7b75-4670-8dc8-a9ec009f915b_f366540-ocrred.pdf
- 5fd96c15-b0cb-4197-9dc2-a9ed0098ee00_3635340.pdf
- f3e0ab63-ef77-4884-8cfd-a9ed0098f8ad_3635340.pdf-ocrred.pdf
Comments
2 comments
Hi Koen,
This is a known issue for FineReader Engine in Linux usage. Please try to open the input PDF in your system default PDF viewer on the same computer which runs OCR. The result would be also broken.
The PDF without embedded fonts can be opened and read successfully only on the systems which do have the referenced fonts. For Windows machines, we can be sure that the fonts like "Arial" will be found and the files will be processed successfully (because the fonts come with Windows installation). But for Linux machines, reading such PDF required an additional font pack.
For further compatibility, we strongly advise against the creation of PDFs which are intended to be distributed among different environments without the embedded fonts. In order to continue working with already existing files, please choose any of the following options:
· Install fonts and set up the compatibility options in your Linux system. You can read more about this on the Linux community page https://askubuntu.com/questions/651441/how-to-install-arial-font-in-ubuntu
· Alternatively, you may repair a PDF file and embed missing fonts as it's described on community page: https://stackoverflow.com/questions/12857849/how-to-repair-a-pdf-file-and-embed-missing-fonts/13131101#13131101
Hi Nadezhda
Many thanks for your response.
Installing the MS Core Fonts ("apt-get install ttf-mscorefonts-installer") is the solution.
Best regards
Koen de Leijer
Please sign in to leave a comment.