When a single table has many rows, that spans perhaps 10 pages in a PDF file, yet it has the same column structure across all 10 pages, is it possible to get a single table output?
Example: monthly bank or brokerage statements that list all the transactions in an account for the month, perhaps over a span of 5-10 pages. Columns are consistent across all pages BUT the bank/brokerage firm probably used a "Report Writer" off a mainframe where things like repeating dates are suppressed past the first one to clean up / simplify the print out. In the worse case the statements may have odd and even page layout differences to accommodate a three hole punch. Column headings repeat on each page and outside the table area are titles, page numbers, "continued on the next page", etc. Just want one, clean table output.
ADRT (Adaptive Document Recognition Technology) A technology that increases the quality of
conversion of multi-page documents. For example, it can recognize such elements as headings,
headers and footers, footnotes, page numbering, and signatures.
ADRT feature seems spot on. Very new to OCR editor, have only tried to use it for a few hours trying to do the above. Wish documentation covered best practices around recognizing multi-page tables (strikes me as a very frequent use-case.) Everything seems very PAGE by PAGE oriented (one table per page).
Of course can clean up things post OCR in Excel using VBA scripts. E.g. fill in blanks from above where the Report Writer suppressed them, remove every page embedded column headings, etc. But realigning columns and dealing with radically different column widths/positions on each page as little "island" is a pain.
I suspect the ADRT "under the hood" can easily do this - perhaps I am confused by the GUI or the early documentation on all the table features (especially multi page).
Any tips or techniques most appreciated.
Comments
3 comments
Hello,
Unfortunately, detecting and recognizing a multi-page table as one continuous structure, with its own repeated header, common column widths, and omitted repeated values is not supported by FineReader PDF. Anyway, thank you for the suggestion, it will be forwarded to the product team.
Best regards,
Yuriy
With much experimentation having better results. Got pulled off on another project for a week but wanted to post this which worked to solve most issues:
The software is very impressive. Documentation is nice quality but could be greatly expanded in the table area. For example, can you add vertical separators on page 1 and will they be saved and used when applying that area template to pages 2-N? (it seems like ONLY the rectangle plus the type (table) are saved to the template. Skip adding the separators on page one if they are not saved into the template.
Thankfully I have 600 dpi scans from originals. A auto document feeder (ADF) was used. I suspect that slight left right shifting of the page in the ADF (or in the front back original printing of the documents shifting left/right on the page) could upset the template area recognition, but so far that has not been an issue.
Its a mystery why you can't load a template (consisting of a rectangle typed table) on each page and have the option to have the "Analise table structure" module run right after the template is applied to each page. Would be great to have manual vertical separators and or the verticals in the page 1 area saved into the template and then favored when "Analise Table Structure" is later run on page 2-N.
Really the only need for "Analise Table Structure" on pages 2-N is to slice rows.
Wish are area templates were XML and deeply documented, along with more documentation on how "Analise Table Structure" works to slice and dice. As mentioned above, would be great to have it favor manual separators, plus perhaps have a (information|warning|error) log?
OCR sub system seems to have the ability to "learn new things". Would be great for the table slice / dice subsystem to have similar training capability. Yes its fuzzy, but there are things to "favor".
Overall very impressive capabilities!
Hi RJL,
Thank you for appreciation of FineReader PDF capabilities and putting down such a detailed summary of your findings! I'm sure it will be really useful for other people who work with this kind of complex multi-page tables.
Area template is definitely a tool that may help in dealing with such kind of tables. One just needs to understand its boundaries when using it. As you rightfully mentioned, if the tables on different pages are positioned differently or the scans are somewhat misplaced, the area template will not match the tables. There are no tools to automatically align a template to each page in FineReader PDF (there are other, enterprise-grade ABBYY solutions such as FlexiCapture that can do that sort of tasks), and one would need to manually inspect the pages in the project to check for area-table mismatches and manually adjust areas when needed.
I'd like to comment on just a couple of things you've mentioned.
1. "can you add vertical separators on page 1 and will they be saved and used when applying that area template to pages 2-N?" - Actually, it's possible to add a vertical separator to a table area manually (there's a button right next to the one for adding a horizontal one for that), and if you save the template anew after that, this separator will stay in the template.
2. "Its a mystery why you can't load a template (consisting of a rectangle typed table) on each page and have the option to have the "Analyze table structure" module run right after the template is applied to each page." - You can do that: right-click on the loaded template area and choose "Analyze table structure". It must be done separately for each page though, that's correct, and manually added or otherwise edited in the template separators will not be preferred (i.e. kept).
Best regards,
Yuriy
Please sign in to leave a comment.