Community

[FREngine] How to limit the number of pages for processing Answered

Written by Permanently deleted user

November 17, 2016 19:25
9

Hi,

I want to process only few pages of a large pdf. I can't get IFRDocument.ProcessPages to work because I'm not sure what to do with / how to set IIntsCollection.

For now I have the following snippet to OCR only the first and last page:

// Create document
IFRDocument document = engine.CreateFRDocument();

// Add image file to document
document.AddImageFile( imagePath, null, null );

// Get page-count
int pagesCount = document.getPages().getCount();
if (pagesCount > 2) {
    //only first and last page
    IIntsCollection indices=engine.CreateIntsCollection();
    indices.Add(0);
    indices.Add(pagesCount-1);
    document.ProcessPages(indices, null);
} else {
    //process full document
    document.Process( null );
}

But that gives me an error: Document synthesis has not been performed for the page with index 1

Regards

Was this article helpful?

0 out of 0 found this helpful

Comments

9 comments

Permanently deleted user

November 23, 2016 13:55

Hi Koen,

Sorry for long silence. I've converted your question to the separate post, as you ask about our offline FREngine product, not Cloud OCR SDK.

The issue occurs because if you want to get a multipage output, you need to perform document synthesis of all pages before export (the Process… method includes document synthesis). Thus, you need to process all pages. For that you can OCR only the first and last pages of your document, and the other pages should be processed using a visible text layer of the source PDF file. Please use the SourceContentReuseMode property of the ObjectsExtractionParams object for this. Below there is a code snippet in C# (sorry that it is not in Java, but the idea is clear), how to implement this scenario:

// Get page-count
int pagesCount = document.Pages.Count;

if (pagesCount > 2)
{
        //only first and last page
        FREngine.IntsCollection indicesToOCR= engineLoader.Engine.CreateIntsCollection();
        indicesToOCR.Add(0);
        indicesToOCR.Add(pagesCount - 1);
        document.ProcessPages(indicesToOCR, null);

        FREngine.DocumentProcessingParams docProcessingParams = engineLoader.Engine.CreateDocumentProcessingParams();
        docProcessingParams.PageProcessingParams.ObjectsExtractionParams.SourceContentReuseMode = FREngine.SourceContentReuseModeEnum.CRM_ContentOnly;

        for (int i = 1; i < pagesCount - 1; i++)
        {
                FREngine.IntsCollection indicesWithContent = engineLoader.Engine.CreateIntsCollection();
                indicesWithContent.Add(i);
                document.ProcessPages(indicesWithContent, docProcessingParams);
        }
}
else
{
        //process full document
        document.Process(null);
}

Hope this will be useful!

Permanently deleted user

June 08, 2018 04:56
How to use GetPagesToProcess Function of IFileAdapter in Hello C# Code of Finereader Engine 12 and can you explain why I have to do document synthesis in the above code

0
Daria Zvereva

June 08, 2018 14:10
Hi!

As we have already answered you in the post you should see our standard BatchProcessing code sample in C#.

Document processing in ABBYY FineReader Engine consists of several steps: page preprocessing, analysis, recognition, page synthesis, document synthesis, and export. At the document synthesis stage the font styles and the logical structure of the document are recreated. This stage is required before the export stage. During export recognized documents are saved in files in suitable formats.

Hope this information will be usefull.

0
Permanently deleted user

July 30, 2018 13:47
ho can we perform same using JAVA. I have multiple page pdf document on which i have to apply file reader to convert that into editable format?

0
Permanently deleted user

July 30, 2018 13:47
ho can we perform same using JAVA. I have multiple page pdf document on which i have to apply file reader to convert that into editable format?

0
Permanently deleted user

July 30, 2018 13:48
how can we perform same using JAVA. I have multiple page pdf document on which i have to apply file reader to convert that into editable format?

0

Permanently deleted user

July 31, 2018 09:46

Hi we are trying page range limitation with below code.We will be giving start range and end range of the page to digitize but we are facing problem for some pdf if we are giving 1-2 as page range it is digitizing the whole document or for some pdf if we are giving 2-3 it is digitizing from 1 to 3 page but it should do from 2 to 3.I dont know what is going wrong please review below code for your reference.

document.AddImageFile(inPutFilePath, null, null);

// Get page-count
int pagesCount = document.Pages.Count;
FREngine.DocumentProcessingParams docProcessingParams = engine.CreateDocumentProcessingParams();

// Configure/Setup processing parameters for accuracy 
docProcessingParams.PageProcessingParams.ObjectsExtractionParams.EnableAggressiveTextExtraction = true;
docProcessingParams.PageProcessingParams.ObjectsExtractionParams.DetectTextOnPictures = true;
docProcessingParams.PageProcessingParams.PagePreprocessingParams.CorrectOrientation = true;
docProcessingParams.PageProcessingParams.PageAnalysisParams.EnableExhaustiveAnalysisMode = true;
docProcessingParams.PageProcessingParams.RecognizerParams.TextTypes = (int)TextTypeEnum.TT_Normal | (int)TextTypeEnum.TT_Matrix | (int)TextTypeEnum.TT_Typewriter; 

// Process pages based on Page range.
if (request.page_range != null) { 
   FREngine.IIntsCollection pageIndices = engine.CreateIntsCollection();
   string[] arrayPageRange = request.page_range.Split('-');
   int[] digpages = new int[] { }; 
   int startRange = Int32.Parse(arrayPageRange[0]);
   int endRange = Int32.Parse(arrayPageRange[1]); 
   for (int i = startRange; i <= endRange; i++) {
      int rangeValue = i - 1; 
      digpages = digpages.Concat(new int[] { rangeValue }).ToArray();
      pageIndices.Add(rangeValue); 
      document.ProcessPages(pageIndices, docProcessingParams);
   } 
   FREngine.IIntsCollection indicesWithContent = engine.CreateIntsCollection();
   FREngine.DocumentProcessingParams dpp = engine.CreateDocumentProcessingParams();
   dpp.PageProcessingParams.ObjectsExtractionParams.SourceContentReuseMode = FREngine.SourceContentReuseModeEnum.CRM_ContentOnly;
   for (int i = 0; i < pagesCount; i++) { 
    if (!digpages.Contains(i)) {
    indicesWithContent.Add(i);
    document.ProcessPages(indicesWithContent, dpp); 
   }  
  }
}
   else
     {
        //process full document
        document.Process(docProcessingParams);
     }

Permanently deleted user

September 05, 2018 07:57
Hi,

What is the use of SourceContentReuseMode?

0
Permanently deleted user

September 05, 2018 11:51
Hi Rama

SourceContentReuseMode is available in the documentation of ABBYY FineReader:
https://knowledgebase.abbyy.com/article/1581

SourceContentReuseModeEnum

SourceContentReuseModeEnum enumeration constants describe available modes of source PDF file contents reusing.

typedef enum { CRM_Auto, CRM_DoNotReuse, CRM_ContentOnly } SourceContentReuseModeEnum;

Elements
Name Description CRM_Auto ABBYY FineReader Engine uses both text and image layer of the source PDF file. CRM_ContentOnly Only visible text layer of the source PDF file is used, the image layer is not used.
Do not use this setting if the source file contains only raster information: for example, for image-only PDFs. To find out if the file contains any text layer use the IsPdfWithTextualContent method. However, note that if the document contains only invisible text layer detected by the IsPdfWithTextualContent method, this text layer will not be used in this mode.
CRM_DoNotReuse Text layer of the source PDF file is not used, the image layer is recognized by ABBYY FineReader Engine.
See also

And:

Best regards

Koen de Leijer
0

Please sign in to leave a comment.

Community

[FREngine] How to limit the number of pages for processing Answered

Was this article helpful?

Comments

SourceContentReuseModeEnum

Elements

See also

Didn't find what you were looking for?