Tables only from PDF

Helen Osetrova

August 10, 2018 14:43

Hello!

Please create the PageAnalysisParams object and tune its properties: set IPageAnalysisParams::DetectText = false and IPageAnalysisParams::DetectTables = true. Please learn more about the PageAnalysisParams object in Developer’s Help → API Reference → Parameter Objects → Preprocessing, Analysis, Recognition, and Synthesis Parameters → PageAnalysisParams.

You can also create a user profile with the required settings, save it as an .ini file and then load it using the IEngine::LoadProfile method:

private IEngine engine = null;
...
engine = Engine.GetEngineObject( SamplesConfig.GetDllFolder(), SamplesConfig.GetDeveloperSN() );
...
engine.LoadProfile( "../profile.ini" );

“profile.ini” should contain the following strings:

[PageAnalysisParams]
DetectText = false
DetectTables = true

It is possible to specify also the recognition language and many other options in the profile. Please learn more about profiles usage in Developer’s Help → Guided Tour → Advanced Techniques → Working with Profiles.

You could find very comprehensive Java samples for different scenarios under the %ABBYY FineReader Engine folder%/Samples/Java directory.

0

Permanently deleted user

August 13, 2018 12:58

how to use that profile.ini file to extract only tables from pdf?

0

Helen Osetrova

August 14, 2018 11:32

Hello,

Please note that the profiles usage is described in detail in Developer’s Help → Guided Tour → Advanced Techniques → Working with Profiles.

When some new objects are created, the properties of newly created objects are usually set to reasonable defaults. But default values are not always optimal for all usage scenarios. You may need to change these properties in some cases. This can be done either via the API or with the help of a profile. A profile contains a list of new default values for object properties. The LoadProfile() method of the Engine object allows you to load a user profile file (profile.ini). After this file is loaded, newly created objects will have the new default values specified in the file.

So, to use the profile for processing, you should implement the LoadProfile( String FileName ) with the only parameter FileName. FileName contains the path to the profile file. You can specify either a full path or a path relative to the current directory.

Please find below Java code snippet based on our Java sample included to the FineReader Engine distribution pack:

    private void processImage() {

        String imagePath = SamplesConfig.GetSamplesFolder() + "\\SampleImages\\Demo.tif";
        String profilePath = SamplesConfig.GetSamplesFolder() + "\\SampleImages\\profile.ini";  // you should put profile.ini to the specified directory

        try {

            // Load Engine
            engine = Engine.GetEngineObject( SamplesConfig.GetDllFolder(), SamplesConfig.GetDeveloperSN() ); // you should specify a valid Developer Serial Number in SamplesConfig.java        

            // Load profile.ini
            Engine.LoadProfile(profilePath);

            // Create document
            IFRDocument document = engine.CreateFRDocument();            

            try {

                // Add image file to document
                displayMessage( "Loading image...");
                document.AddImageFile( imagePath, null, null );

                // Process document
                displayMessage( "Process...");
                document.Process();
             
                // Save results
                displayMessage( "Saving results...");

                // Save results to rtf with default parameters
                String rtfExportPath = SamplesConfig.GetSamplesFolder() + "\\SampleImages\\Demo.docx";
                document.Export( rtfExportPath, FileExportFormatEnum.FEF_DOCX, null );

            } finally {

                // Close document
                document.Close();

                displayMessage("Done ...");

                // Unload Engine
                engine = null;
                Engine.DeinitializeEngine();

           }

        } catch( Exception ex ) {

            displayMessage( ex.getMessage() );

        }
    }

0

Permanently deleted user

August 16, 2018 07:12

can we do this using Layout and Blocks?

and how to implement that?

0

Permanently deleted user

August 17, 2018 07:57

How to extract this table in proper way? The process we discussed above is extracting this table but it is dividing 'Firefox 1.0' as two cells and giving as two columns. How can I avoid that and get proper table?

0

Permanently deleted user

August 17, 2018 22:56

Hi Rama,

Please try to add following strings to the profile.ini file:

[RTFExportParams]

KeepLines = true

PageSynthesisMode = PSM_RTFEditableCopy

0

Permanently deleted user

August 20, 2018 13:58

Rama Reddy posted this 1 minute ago

if i am using Blocks. I am able to identify blocks of type Table. But how can I collect all the blocks and how can I export them?

0

Permanently deleted user

August 21, 2018 10:35

why i am not getting the blocks? it is showing Zero blocks.

IFRDocument document = engine.CreateFRDocument();

try {

// Add image file to document

displayMessage( "Loading image..." );

IRegionsCollection reg=engine.CreateRegionsCollection();

document.AddImageFile( imagePath, null, null );

IFRPages pages=document.getPages();

IRegion region=engine.CreateRegion();

System.out.println(pages.getCount());

if (pages != null && pages.getCount() > 0)

{

for(int i=0; i<pages.getCount();i++)

{ IFRPage page=pages.Item(i);

ILayout lay_out= page.getLayout();

System.out.println(page.getLayout());

ILayoutBlocks blocks=lay_out.getBlocks();

System.out.println(lay_out.getBlocks());

System.out.println(document.getPages().Item(i).getLayout().getBlocks().getCount());

document.getPages().Item(i).getLayout().getBlocks().DeleteAll();

int c=0;

System.out.println(blocks.getCount());

if(blocks != null && blocks.getCount()>0)

{

for(int j=0;i<blocks.getCount();j++)

{ IBlock block=blocks.Item(j);

System.out.println(block.getType());

if(block.getType()==BlockTypeEnum.BT_Table)

{System.out.println(c);

ITableBlock tblock=block.GetAsTableBlock();

region=block.getRegion();

document.getPages().Item(i).getLayout().getBlocks().AddNew(block.getType(),region,c);

c++;

}

//document.ProcessPages(null,null,reg);

document.Recognize(null,null);

document.Synthesize(null);

String texExportPath = SamplesConfig.GetSamplesFolder() + "images/Emely_11111.xls";

document.Export( texExportPath, FileExportFormatEnum.FEF_XLSX, null);

}

0

Helen Osetrova

August 21, 2018 14:21

Hi Rama!

Document processing in ABBYY FineReader Engine consists of several steps: page preprocessing, analysis, recognition, page synthesis, document synthesis, and export. Getting access to the document layout is possible after the analysis stage.

Please learn more about processing steps on our Technology Portal and in the Developer’s Help → Guided Tour → Advanced Techniques → Tuning Parameters of Page Preprocessing, Analysis, Recognition, and Synthesis. Please note that the IFRDocument::Process() method includes all stages of processing except the export.

So, before working with the page blocks you should apply the IFRDocument::Analyze() method for the whole document or the IFRPage::Analyze() method for each page. Otherwise, the Blocks collection will be empty.

After this, please do the following for each document page:

1. Create a new Layout instance and add on it every TableBlock that you are interested in.

2. Set the newly created layout as an actual layout for the page and perform page synthesis.

Please see the Java code sample below:

try {

                // Add image file to document
                displayMessage( "Loading image..." );
                document.AddImageFile( imagePath, null, null );
                document.Preprocess( null, null, null, null);
                document.Analyze( null, null, null); 

                IFRPages frPages = document.getPages();
                int pagesCount = frPages.getCount();

                for (int j =0; j < pagesCount; j++) {
                    ILayout layout = engine.CreateLayout();
                    ILayoutBlocks layBlocks = layout.getBlocks();

                    IFRPage page = frPages.getElement(j);
                    ILayout pageLayout = page.getLayout();
                    ILayoutBlocks blocks = pageLayout.getBlocks();
                    int blocksCount = blocks.getCount();

                    for (int i = 0; i < blocksCount; i++) {
                        IBlock block = blocks.getElement(i);
                        BlockTypeEnum blockType = block.getType();

                        if (blockType == BlockTypeEnum.BT_Table) {
                            IRegion region = block.getRegion();
                            layBlocks.AddNew(BlockTypeEnum.BT_Table, region, 0);

                        }

                        page.setLayout(layout);
                        page.Recognize(null, null);                      

                    }
                }

                document.Synthesize(null);
...
                document.Export( texExportPath, FileExportFormatEnum.FEF_XLSX, null);

}

You can learn more about working with the document layout in the Developer’s Help → Guided Tour → Advanced Techniques → Working with Layout and Blocks section.

0

Permanently deleted user

August 22, 2018 08:14

and still it is giving blocks as zero.

It is giving error at engine.CreateLayout();

and still i am getting zero blocks.

0

Helen Osetrova

August 22, 2018 12:21

Hello Rama!

Could you give us some more details about the issue that you face? What kind of error is it? What form does it take?

In additional, I would like to apologize for a mistake slipped in the code sample. Please place following lines outside of the inner for (int i = 0; i < blocksCount; i++) { ... } block:

page.setLayout(layout);
page.Recognize(null, null);

So, the changed code should look in the following way:

...
for (int j =0; j < pagesCount; j++) {
...
  for (int i = 0; i < blocksCount; i++) {
       ...
       if (blockType == BlockTypeEnum.BT_Table) {
           ...
       }
  } // end of the inner for block

page.setLayout(layout);
page.Recognize(null, null); 

} // end of the outer for block

document.Synthesize(null);
...
document.Export( texExportPath, FileExportFormatEnum.FEF_XLSX, null);
...

If you still have difficulties after the source code modification, please describe the issue as fully as possible to help us to assist you better.

0

Permanently deleted user

August 22, 2018 13:31

I am able to check blocks but how can we leave the all othe blocks and keep text blocks in document and export them?

private void processImage() {

 String imagePath = SamplesConfig.GetSamplesFolder() + "images/interest-notice.jpg";

try {

// Don't recognize PDF file with a textual content, just copy it

if( engine.IsPdfWithTextualContent( imagePath, null ) ) {

 displayMessage( "Copy results..." );

                 String resultPath = SamplesConfig.GetSamplesFolder() + "interest-notice.pdf";

Files.copy( Paths.get( imagePath ), Paths.get( resultPath ), StandardCopyOption.REPLACE_EXISTING );

}

 // Create document

IFRDocument document = engine.CreateFRDocument();

try {

// Add image file to document

displayMessage( "Loading image..." );

   //IRegionsCollection reg=engine.CreateRegionsCollection();

 

document.AddImageFile( imagePath,null,null);

document.Preprocess(null,null,null,null);

document.Analyze(null,null,null);

 IFRPages pages=document.getPages();

int page_cnt=pages.getCount();

//IRegion region=engine.CreateRegion();

//System.out.println(pages.getCount());

if (pages != null &&  page_cnt > 0)

{

for(int i=0; i< page_cnt;i++)

 

  {   ILayout layout=engine.CreateLayout();

      ILayoutBlocks layblocks=layout.getBlocks();

  IFRPage page=pages.getElement(i);

  page.Analyze(null,null,null);

  page.Recognize(null,null);

  page.Synthesize(null);

   

  ILayout lay_out= page.getLayout();

  System.out.println(page.getLayout());

  ILayoutBlocks blocks=lay_out.getBlocks();

System.out.println(lay_out.getBlocks());

layblocks.DeleteAll();

   int blocks_cnt=blocks.getCount();

  System.out.println(blocks.getCount());

   if(blocks != null && blocks_cnt>0)

   {

    for(int j=0;i<blocks_cnt;j++)

    { IBlock block=blocks.getElement(j);

    System.out.println(block.getType());

      if(block.getType()==BlockTypeEnum.BT_Table)

      {

       IRegion region =block.getRegion();

       layblocks.AddNew(BlockTypeEnum.BT_Table,region,0);

       }

       page.setLayout(layout);

       page.Recognize(null,null);

     }

     }

   }

   }

 

document.Synthesize(null);

 String texExportPath = SamplesConfig.GetSamplesFolder() + "images/interest-notice34.xls";

document.Export( texExportPath, FileExportFormatEnum.FEF_XLSX, null);

 } 

finally {

// Close document

document.Close();

}

} catch( Exception ex ) {

displayMessage( ex.getMessage());

}

  }

0

Helen Osetrova

August 23, 2018 17:49

Hi Rama,

To make us able to assist you better, can you please clarify what version of ABBYY products do you use?

In case if creating a new ILayout instance does not work for you, please try the following:

obtain an actual page layout;
check its blocks one by one and remove the ones which type is not BT_Table.

Please find below the code snippet that illustrates the suggested approach:

...

IFRPages frPages = document.getPages();
int pagesCount = frPages.getCount();

for (int j =0; j < pagesCount; j++) {

     IFRPage page = frPages.getElement(j);
     ILayout pageLayout = page.getLayout();
     ILayoutBlocks blocks = pageLayout.getBlocks();
     int blocksCount = blocks.getCount();
     int i =0;

      while (i < blocksCount) {

          IBlock block = blocks.getElement(i);
          displayMessage( "Checking blocks");
          BlockTypeEnum blockType = block.getType();

          if (blockType != BlockTypeEnum.BT_Table) {

             displayMessage( "Delete block");
             blocks.DeleteAt(i);
             blocksCount = blocks.getCount();
             continue;

          }

      i++;
      } // iterating blocks

displayMessage( "Recognize page...");
page.Recognize(null, null);

} // iterating pages

...

0

Permanently deleted user

August 25, 2018 19:42

Hi - In FineReader 14 windows (corporate) 2 questions:

1) When FineReader converts a HTML document to PDF, is there a way to avoid having any page breaks in the PDF document? My documents have both text and tables and I want to avoid tables being split between 2 pages in the PDF document.

2) can I associate pre-formatted table templates for a specific document in Hot Folder so when FineReader scans/OCR that document it automatically finds the table in the document associated with the template and applies the template to it?

0

Permanently deleted user

August 27, 2018 04:03

Hi Helen,

thank you its working. but the issue is some tables are not analyzed properly and some part of table is not identified as Table. What is all parameters I have to make it better and analyze the document well??

0

Helen Osetrova

August 27, 2018 16:19

Hi Rama,

Please try to tune the parameters of FRDocument::Analyze() method. For example, create the IPageAnalysisParams object and set its AggressiveTableDetection property to true. If the part of the table appears as a picture in the result file, try also to set the DetectPictures property of IPageAnalysisParams to false. Then pass the newly created object to the IFRDocument::Analyze() method:

...
IPageAnalysisParams pageAnalysisParams = engine.CreatePageAnalysisParams();
pageAnalysisParams.setAggressiveTableDetection(true);
pageAnalysisParams.setDetectPictures(false); 
document.Analyze(pageAnalysisParams,null,null);
...

You can also specify the particular block on a page as a table block and analyze its structure with the help of the IFRPage::AnalyzeTable() method. Please learn more about this method from the Developer's Help.

0

Community

Was this article helpful?

Comments

I am able to check blocks but how can we leave the all othe blocks and keep text blocks in document and export them?

Didn't find what you were looking for?