Community

Tables only from PDF

 Hi,

 

I have a pdf with two tables and text. I want to extract only tables and leave text by using ABBYY Finereader JAVA API. how can we do that.Can you suggest me any java code ?

0

Comments

16 comments

  • Avatar
    Helen Osetrova

    Hello!

     

    Please create the PageAnalysisParams object and tune its properties: set IPageAnalysisParams::DetectText = false and IPageAnalysisParams::DetectTables = true. Please learn more about the PageAnalysisParams object in Developer’s Help → API Reference → Parameter Objects → Preprocessing, Analysis, Recognition, and Synthesis Parameters →  PageAnalysisParams

     

    You can also create a user profile with the  required settings, save it as an .ini file and then load it using the IEngine::LoadProfile method:

    private IEngine engine = null;
    ...
    engine = Engine.GetEngineObject( SamplesConfig.GetDllFolder(), SamplesConfig.GetDeveloperSN() );
    ...
    engine.LoadProfile( "../profile.ini" );

     

    “profile.ini” should contain the following strings:

    [PageAnalysisParams]
    DetectText = false
    DetectTables = true

     

    It is possible to specify also the recognition language and many other options in the profile. Please learn more about profiles usage in Developer’s Help → Guided Tour → Advanced Techniques → Working with Profiles.

     

    You could find very comprehensive Java samples for different scenarios under the %ABBYY FineReader Engine folder%/Samples/Java directory.

     

    0
    Comment actions Permalink
  • Avatar
    Rama Reddy

    how to use that profile.ini file to extract only tables from pdf?

    0
    Comment actions Permalink
  • Avatar
    Helen Osetrova

    Hello,

     

    Please note that the profiles usage is described in detail in Developer’s Help → Guided Tour → Advanced Techniques → Working with Profiles.

     

    When some new objects are created, the properties of newly created objects are usually set to reasonable defaults. But default values are not always optimal for all usage scenarios. You may need to change these properties in some cases. This can be done either via the API or with the help of a profile. A profile contains a list of new default values for object properties. The LoadProfile() method of the Engine object allows you to load a user profile file (profile.ini). After this file is loaded, newly created objects will have the new default values specified in the file.

     

    So, to use the profile for processing, you should implement the LoadProfile( String FileName ) with the only parameter FileName. FileName contains the path to the profile file. You can specify either a full path or a path relative to the current directory. 

     

    Please find below Java code snippet based on our Java sample included to the FineReader Engine distribution pack:

     

        private void processImage() {

            String imagePath = SamplesConfig.GetSamplesFolder() + "\\SampleImages\\Demo.tif";
            String profilePath = SamplesConfig.GetSamplesFolder() + "\\SampleImages\\profile.ini";  // you should put profile.ini to the specified directory

            try {

                // Load Engine
                engine = Engine.GetEngineObject( SamplesConfig.GetDllFolder(), SamplesConfig.GetDeveloperSN() ); // you should specify a valid Developer Serial Number in SamplesConfig.java        

                // Load profile.ini
               Engine.LoadProfile(profilePath);

                // Create document
                IFRDocument document = engine.CreateFRDocument();            

                try {

                    // Add image file to document
                    displayMessage( "Loading image...");
                    document.AddImageFile( imagePath, null, null );

                    // Process document
                    displayMessage( "Process...");
                    document.Process();
                
                    // Save results
                    displayMessage( "Saving results...");

                    // Save results to rtf with default parameters
                    String rtfExportPath = SamplesConfig.GetSamplesFolder() + "\\SampleImages\\Demo.docx";
                    document.Export( rtfExportPath, FileExportFormatEnum.FEF_DOCX, null );

                } finally {

                    // Close document
                    document.Close();

                    displayMessage("Done ...");

                    // Unload Engine
                    engine = null;
                    Engine.DeinitializeEngine();

               }

            } catch( Exception ex ) {

                displayMessage( ex.getMessage() );

            }
        }

     

    0
    Comment actions Permalink
  • Avatar
    Rama Reddy

    can we do this using Layout and Blocks?

    and how to implement that?

    0
    Comment actions Permalink
  • Avatar
    Rama Reddy

    How to extract this table in proper way? The process we discussed above is extracting this table but it is dividing 'Firefox 1.0' as two cells and giving as two columns. How can I avoid that and get proper table?

    0
    Comment actions Permalink
  • Avatar
    Oksana Serdyuk

    Hi Rama,

    Please try to add following strings to the profile.ini file:

    [RTFExportParams]
    KeepLines = true
    PageSynthesisMode = PSM_RTFEditableCopy
    0
    Comment actions Permalink
  • Avatar
    Rama Reddy
    Rama Reddy posted this 1 minute ago

     

    if i am using Blocks. I am able to identify blocks of type Table. But how can I collect all the blocks and how can I export them?

    0
    Comment actions Permalink
  • Avatar
    Rama Reddy

    why i am not getting the blocks? it is showing Zero blocks.

     

    IFRDocument document = engine.CreateFRDocument();

     

    try {

    // Add image file to document

    displayMessage( "Loading image..." );

        IRegionsCollection reg=engine.CreateRegionsCollection();

     

    document.AddImageFile( imagePath, null, null );

    IFRPages pages=document.getPages();

    IRegion region=engine.CreateRegion();

    System.out.println(pages.getCount());

    if (pages != null && pages.getCount() > 0)

    {

    for(int i=0; i<pages.getCount();i++)

      {   IFRPage page=pages.Item(i);

       ILayout lay_out= page.getLayout();

       System.out.println(page.getLayout());

       ILayoutBlocks blocks=lay_out.getBlocks();

       System.out.println(lay_out.getBlocks());

     

       System.out.println(document.getPages().Item(i).getLayout().getBlocks().getCount());

       document.getPages().Item(i).getLayout().getBlocks().DeleteAll();

       int c=0;

       System.out.println(blocks.getCount());

       if(blocks != null && blocks.getCount()>0)

       {

        for(int j=0;i<blocks.getCount();j++)

        { IBlock block=blocks.Item(j);

        System.out.println(block.getType());

          if(block.getType()==BlockTypeEnum.BT_Table)

         

          {System.out.println(c);

          ITableBlock tblock=block.GetAsTableBlock();

         

           region=block.getRegion();

           

       document.getPages().Item(i).getLayout().getBlocks().AddNew(block.getType(),region,c);

       

       

           c++;

           }

         }

         }

       }

       }

     

     

    //document.ProcessPages(null,null,reg);

     

    document.Recognize(null,null);

    document.Synthesize(null);

     

    String texExportPath = SamplesConfig.GetSamplesFolder() + "images/Emely_11111.xls";

    document.Export( texExportPath, FileExportFormatEnum.FEF_XLSX, null);

     

    0
    Comment actions Permalink
  • Avatar
    Helen Osetrova

    Hi Rama!

     

    Document processing in ABBYY FineReader Engine consists of several steps: page preprocessing, analysis, recognition, page synthesis, document synthesis, and export. Getting access to the document layout is possible after the analysis stage. 

    Please learn more about processing steps on our Technology Portal and in the Developer’s Help → Guided Tour → Advanced Techniques → Tuning Parameters of Page Preprocessing, Analysis, Recognition, and Synthesis. Please note that the IFRDocument::Process() method includes all stages of processing except the export.

     

    So, before working with the page blocks you should apply the IFRDocument::Analyze() method for the whole document or the IFRPage::Analyze() method for each page. Otherwise, the Blocks collection will be empty.

     

     

    After this, please do the following for each document page:

    1. Create a new Layout instance and add on it every TableBlock that you are interested in.

    2. Set the newly created layout as an actual layout for the page and perform page synthesis.

     

    Please see the Java code sample below:

    try {

                    // Add image file to document
                    displayMessage( "Loading image..." );
                    document.AddImageFile( imagePath, null, null );
                    document.Preprocess( null, null, null, null);
                    document.Analyze( null, null, null);

                    IFRPages frPages = document.getPages();
                    int pagesCount = frPages.getCount();

                    for (int j =0; j < pagesCount; j++) {
                        ILayout layout = engine.CreateLayout();
                        ILayoutBlocks layBlocks = layout.getBlocks();

                        IFRPage page = frPages.getElement(j);
                        ILayout pageLayout = page.getLayout();
                        ILayoutBlocks blocks = pageLayout.getBlocks();
                        int blocksCount = blocks.getCount();

                        for (int i = 0; i < blocksCount; i++) {
                            IBlock block = blocks.getElement(i);
                            BlockTypeEnum blockType = block.getType();

                            if (blockType == BlockTypeEnum.BT_Table) {
                                IRegion region = block.getRegion();
                                layBlocks.AddNew(BlockTypeEnum.BT_Table, region, 0);

                            }

                            page.setLayout(layout);
                            page.Recognize(null, null);
                         

                        }
                    }

                    document.Synthesize(null);
    ...
                    document.Export( texExportPath, FileExportFormatEnum.FEF_XLSX, null);

    }

     

    You can learn more about working with the document layout in the Developer’s Help → Guided Tour → Advanced Techniques → Working with Layout and Blocks section.

     

    0
    Comment actions Permalink
  • Avatar
    Rama Reddy

    and still it is giving blocks as zero. 

    It is giving error at engine.CreateLayout();

     and still i am getting zero blocks.

    0
    Comment actions Permalink
  • Avatar
    Helen Osetrova

    Hello Rama!

     

    Could you give us some more details about the issue that you face? What kind of error is it? What form does it take? 

     

    In additional, I would like to apologize for a mistake slipped in the code sample. Please place following lines outside of the inner for (int i = 0; i < blocksCount; i++) { ... } block:

    page.setLayout(layout);
    page.Recognize(null, null);

     

    So, the changed code should look in the following way: 

    ...
    for (int j =0; j < pagesCount; j++) { ... for (int i = 0; i < blocksCount; i++) { ... if (blockType == BlockTypeEnum.BT_Table) { ... } } // end of the inner for block
    page.setLayout(layout); page.Recognize(null, null);
    } // end of the outer for block document.Synthesize(null); ... document.Export( texExportPath, FileExportFormatEnum.FEF_XLSX, null);
    ...

     

    If you still have difficulties after the source code modification, please describe the issue as fully as possible to help us to assist you better. 

     

    0
    Comment actions Permalink
  • Avatar
    Rama Reddy

    I am able to check blocks but how can we leave the all othe blocks and keep text blocks in document and export them?

    private void processImage() {

     String imagePath = SamplesConfig.GetSamplesFolder() + "images/interest-notice.jpg";

    try {

    // Don't recognize PDF file with a textual content, just copy it

    if( engine.IsPdfWithTextualContent( imagePath, null ) ) {

     displayMessage( "Copy results..." );

                     String resultPath = SamplesConfig.GetSamplesFolder() + "interest-notice.pdf";

    Files.copy( Paths.get( imagePath ), Paths.get( resultPath ), StandardCopyOption.REPLACE_EXISTING );

    }

     // Create document

    IFRDocument document = engine.CreateFRDocument();

    try {

    // Add image file to document

    displayMessage( "Loading image..." );

       //IRegionsCollection reg=engine.CreateRegionsCollection();

     

    document.AddImageFile( imagePath,null,null);

    document.Preprocess(null,null,null,null);

    document.Analyze(null,null,null);

     IFRPages pages=document.getPages();

    int page_cnt=pages.getCount();

    //IRegion region=engine.CreateRegion();

    //System.out.println(pages.getCount());

    if (pages != null &&  page_cnt > 0)

    {

    for(int i=0; i< page_cnt;i++)

     

      {   ILayout layout=engine.CreateLayout();

          ILayoutBlocks layblocks=layout.getBlocks();

      IFRPage page=pages.getElement(i);

      page.Analyze(null,null,null);

      page.Recognize(null,null);

      page.Synthesize(null);

       

      ILayout lay_out= page.getLayout();

      System.out.println(page.getLayout());

      ILayoutBlocks blocks=lay_out.getBlocks();

    System.out.println(lay_out.getBlocks());

    layblocks.DeleteAll();

       int blocks_cnt=blocks.getCount();

      System.out.println(blocks.getCount());

       if(blocks != null && blocks_cnt>0)

       {

        for(int j=0;i<blocks_cnt;j++)

        { IBlock block=blocks.getElement(j);

        System.out.println(block.getType());

          if(block.getType()==BlockTypeEnum.BT_Table)

          {

           IRegion region =block.getRegion();

           layblocks.AddNew(BlockTypeEnum.BT_Table,region,0);

           }

           page.setLayout(layout);

           page.Recognize(null,null);

         }

         }

       }

       }

     

    document.Synthesize(null);

     String texExportPath = SamplesConfig.GetSamplesFolder() + "images/interest-notice34.xls";

    document.Export( texExportPath, FileExportFormatEnum.FEF_XLSX, null);

     } 

    finally {

    // Close document

    document.Close();

    }

    } catch( Exception ex ) {

    displayMessage( ex.getMessage());

    }

      }

     

     

     
    0
    Comment actions Permalink
  • Avatar
    Helen Osetrova

    Hi Rama,

    To make us able to assist you better, can you please clarify what version of ABBYY products do you use? 

     

    In case if creating a new ILayout instance does not work for you, please try the following:

    • obtain an actual page layout;
    • check its blocks one by one and remove the ones which type is not BT_Table

     

    Please find below the code snippet that illustrates the suggested approach: 

    ...

    IFRPages frPages = document.getPages();
    int pagesCount = frPages.getCount();

    for (int j =0; j < pagesCount; j++) {

         IFRPage page = frPages.getElement(j);
         ILayout pageLayout = page.getLayout();
         ILayoutBlocks blocks = pageLayout.getBlocks();
         int blocksCount = blocks.getCount();
      int i =0;

          while (i < blocksCount) {

        IBlock block = blocks.getElement(i);
              displayMessage( "Checking blocks");
              BlockTypeEnum blockType = block.getType();

    if (blockType != BlockTypeEnum.BT_Table) {

                 displayMessage( "Delete block");
                 blocks.DeleteAt(i);
                 blocksCount = blocks.getCount();
                 continue;

          }

    i++;
    } // iterating blocks

    displayMessage( "Recognize page...");
    page.Recognize(null, null);

    } // iterating pages

    ...

     

     

    0
    Comment actions Permalink
  • Avatar
    Christopher Nolan

    Hi - In FineReader 14 windows (corporate) 2 questions: 

    1) When FineReader converts a HTML document to PDF, is there a way to avoid having any page breaks in the PDF document? My documents have both text and tables and I want to avoid tables being split between 2 pages in the PDF document.

    2) can I associate pre-formatted table templates for a specific document in Hot Folder so when FineReader scans/OCR that document it automatically finds the table in the document associated with the template and applies the template to it? 

    0
    Comment actions Permalink
  • Avatar
    Rama Reddy

    Hi Helen,

    thank you its working. but the issue is some tables are not analyzed properly and some part of table is not identified as Table. What is all parameters I have to make it better and analyze the document well??

    0
    Comment actions Permalink
  • Avatar
    Helen Osetrova

    Hi Rama,

     

    Please try to tune the parameters of FRDocument::Analyze() method. For example, create the IPageAnalysisParams object and set its AggressiveTableDetection property to true. If the part of the table appears as a picture in the result file, try also to set the DetectPictures property of IPageAnalysisParams to false. Then pass the newly created object to the IFRDocument::Analyze() method:

    ...
    IPageAnalysisParams pageAnalysisParams = engine.CreatePageAnalysisParams();
    pageAnalysisParams.setAggressiveTableDetection(true);
    pageAnalysisParams.setDetectPictures(false); 
    document.Analyze(pageAnalysisParams,null,null);
    ...

     

    You can also specify the particular block on a page as a table block and analyze its structure with the help of the IFRPage::AnalyzeTable() method. Please learn more about this method from the Developer's Help. 

    0
    Comment actions Permalink

Please sign in to leave a comment.