Hi,
I have a pdf with two tables and text. I want to extract only tables and leave text by using ABBYY Finereader JAVA API. how can we do that.Can you suggest me any java code ?
Hi,
I have a pdf with two tables and text. I want to extract only tables and leave text by using ABBYY Finereader JAVA API. how can we do that.Can you suggest me any java code ?
0 out of 0 found this helpful
Comments
16 comments
Hello!
Please create the PageAnalysisParams object and tune its properties: set IPageAnalysisParams::DetectText = false and IPageAnalysisParams::DetectTables = true. Please learn more about the PageAnalysisParams object in Developer’s Help → API Reference → Parameter Objects → Preprocessing, Analysis, Recognition, and Synthesis Parameters → PageAnalysisParams.
You can also create a user profile with the required settings, save it as an .ini file and then load it using the IEngine::LoadProfile method:
“profile.ini” should contain the following strings:
It is possible to specify also the recognition language and many other options in the profile. Please learn more about profiles usage in Developer’s Help → Guided Tour → Advanced Techniques → Working with Profiles.
You could find very comprehensive Java samples for different scenarios under the %ABBYY FineReader Engine folder%/Samples/Java directory.
how to use that profile.ini file to extract only tables from pdf?
Hello,
Please note that the profiles usage is described in detail in Developer’s Help → Guided Tour → Advanced Techniques → Working with Profiles.
When some new objects are created, the properties of newly created objects are usually set to reasonable defaults. But default values are not always optimal for all usage scenarios. You may need to change these properties in some cases. This can be done either via the API or with the help of a profile. A profile contains a list of new default values for object properties. The LoadProfile() method of the Engine object allows you to load a user profile file (profile.ini). After this file is loaded, newly created objects will have the new default values specified in the file.
So, to use the profile for processing, you should implement the LoadProfile( String FileName ) with the only parameter FileName. FileName contains the path to the profile file. You can specify either a full path or a path relative to the current directory.
Please find below Java code snippet based on our Java sample included to the FineReader Engine distribution pack:
can we do this using Layout and Blocks?
and how to implement that?
How to extract this table in proper way? The process we discussed above is extracting this table but it is dividing 'Firefox 1.0' as two cells and giving as two columns. How can I avoid that and get proper table?
Hi Rama,
Please try to add following strings to the profile.ini file:
if i am using Blocks. I am able to identify blocks of type Table. But how can I collect all the blocks and how can I export them?
why i am not getting the blocks? it is showing Zero blocks.
IFRDocument document = engine.CreateFRDocument();
try {
// Add image file to document
displayMessage( "Loading image..." );
IRegionsCollection reg=engine.CreateRegionsCollection();
document.AddImageFile( imagePath, null, null );
IFRPages pages=document.getPages();
IRegion region=engine.CreateRegion();
System.out.println(pages.getCount());
if (pages != null && pages.getCount() > 0)
{
for(int i=0; i<pages.getCount();i++)
{ IFRPage page=pages.Item(i);
ILayout lay_out= page.getLayout();
System.out.println(page.getLayout());
ILayoutBlocks blocks=lay_out.getBlocks();
System.out.println(lay_out.getBlocks());
System.out.println(document.getPages().Item(i).getLayout().getBlocks().getCount());
document.getPages().Item(i).getLayout().getBlocks().DeleteAll();
int c=0;
System.out.println(blocks.getCount());
if(blocks != null && blocks.getCount()>0)
{
for(int j=0;i<blocks.getCount();j++)
{ IBlock block=blocks.Item(j);
System.out.println(block.getType());
if(block.getType()==BlockTypeEnum.BT_Table)
{System.out.println(c);
ITableBlock tblock=block.GetAsTableBlock();
region=block.getRegion();
document.getPages().Item(i).getLayout().getBlocks().AddNew(block.getType(),region,c);
c++;
}
}
}
}
}
//document.ProcessPages(null,null,reg);
document.Recognize(null,null);
document.Synthesize(null);
String texExportPath = SamplesConfig.GetSamplesFolder() + "images/Emely_11111.xls";
document.Export( texExportPath, FileExportFormatEnum.FEF_XLSX, null);
}
Hi Rama!
Document processing in ABBYY FineReader Engine consists of several steps: page preprocessing, analysis, recognition, page synthesis, document synthesis, and export. Getting access to the document layout is possible after the analysis stage.
Please learn more about processing steps on our Technology Portal and in the Developer’s Help → Guided Tour → Advanced Techniques → Tuning Parameters of Page Preprocessing, Analysis, Recognition, and Synthesis. Please note that the IFRDocument::Process() method includes all stages of processing except the export.
So, before working with the page blocks you should apply the IFRDocument::Analyze() method for the whole document or the IFRPage::Analyze() method for each page. Otherwise, the Blocks collection will be empty.
After this, please do the following for each document page:
1. Create a new Layout instance and add on it every TableBlock that you are interested in.
2. Set the newly created layout as an actual layout for the page and perform page synthesis.
Please see the Java code sample below:
You can learn more about working with the document layout in the Developer’s Help → Guided Tour → Advanced Techniques → Working with Layout and Blocks section.
and still it is giving blocks as zero.
It is giving error at engine.CreateLayout();
and still i am getting zero blocks.
Hello Rama!
Could you give us some more details about the issue that you face? What kind of error is it? What form does it take?
In additional, I would like to apologize for a mistake slipped in the code sample. Please place following lines outside of the inner for (int i = 0; i < blocksCount; i++) { ... } block:
So, the changed code should look in the following way:
If you still have difficulties after the source code modification, please describe the issue as fully as possible to help us to assist you better.
I am able to check blocks but how can we leave the all othe blocks and keep text blocks in document and export them?
Hi Rama,
To make us able to assist you better, can you please clarify what version of ABBYY products do you use?
In case if creating a new ILayout instance does not work for you, please try the following:
Please find below the code snippet that illustrates the suggested approach:
Hi - In FineReader 14 windows (corporate) 2 questions:
1) When FineReader converts a HTML document to PDF, is there a way to avoid having any page breaks in the PDF document? My documents have both text and tables and I want to avoid tables being split between 2 pages in the PDF document.
2) can I associate pre-formatted table templates for a specific document in Hot Folder so when FineReader scans/OCR that document it automatically finds the table in the document associated with the template and applies the template to it?
Hi Helen,
thank you its working. but the issue is some tables are not analyzed properly and some part of table is not identified as Table. What is all parameters I have to make it better and analyze the document well??
Hi Rama,
Please try to tune the parameters of FRDocument::Analyze() method. For example, create the IPageAnalysisParams object and set its AggressiveTableDetection property to true. If the part of the table appears as a picture in the result file, try also to set the DetectPictures property of IPageAnalysisParams to false. Then pass the newly created object to the IFRDocument::Analyze() method:
You can also specify the particular block on a page as a table block and analyze its structure with the help of the IFRPage::AnalyzeTable() method. Please learn more about this method from the Developer's Help.
Please sign in to leave a comment.