Community

multi-page table (fixed column structure with many rows, spread over many PDF pages)

When a single table has many rows, that spans perhaps 10 pages in a PDF file, yet it has the same column structure across all 10 pages, is it possible to get a single table output?

Example: monthly bank or brokerage statements that list all the transactions in an account for the month, perhaps over a span of 5-10 pages. Columns are consistent across all pages BUT the bank/brokerage firm probably used a "Report Writer" off a mainframe where things like repeating dates are suppressed past the first one to clean up / simplify the print out. In the worse case the statements may have odd and even page layout differences to accommodate a three hole punch. Column headings repeat on each page and outside the table area are titles, page numbers, "continued on the next page",  etc. Just want one, clean table output.

ADRT (Adaptive Document Recognition Technology) A technology that increases the quality of
conversion of multi-page documents. For example, it can recognize such elements as headings,
headers and footers, footnotes, page numbering, and signatures.

ADRT feature seems spot on. Very new to OCR editor, have only tried to use it for a few hours trying to do the above. Wish documentation covered best practices around recognizing multi-page tables (strikes me as a very frequent use-case.) Everything seems very PAGE by PAGE oriented (one table per page).

Of course can clean up things post OCR in Excel using VBA scripts. E.g. fill in blanks from above where the Report Writer suppressed them, remove every page embedded column headings, etc. But realigning columns and dealing with radically different column widths/positions on each page as little "island" is a pain.

I suspect the ADRT "under the hood" can easily do this -  perhaps I am confused by the GUI or the early documentation on all the table features (especially multi page).

Any tips or techniques most appreciated.

 

 

Was this article helpful?

0 out of 0 found this helpful

Comments

3 comments

  • Avatar
    Yuriy Korotkevych

    Hello,

    Unfortunately, detecting and recognizing a multi-page table as one continuous  structure, with its own repeated header, common column widths, and omitted repeated values is not supported by FineReader PDF. Anyway, thank you for the suggestion, it will be forwarded to the product team.

    Best regards,

    Yuriy

    0
  • Avatar
    RJL

    With much experimentation having better results. Got pulled off on another project for a week but wanted to post this which worked to solve most issues:

    • turn off automatic
    • draw a recognition area on page 1 for the table
    • be sure the rectangle goes to the left and right edge of the paper (to start)
    • include the table column headers. Include text such as "Continued on next page" if the text is inside the basic table rectangle. (deal with that in later)
    • set the area to table
    • analyze table structure. Don't worry if content appears centered in cells. Don't worry if it pulls the left and right margins in from the paper page edge. 
    • Area->Save area template
    • Area->Load area template with option to "Apply to All pages"
    • Set up a hot key to "analyze table structure"
    • Page through the document manually and run "analyze table structure" on each page. Many pages will find all the cells perfectly. A few may require adding or adjusting table separators to get it perfect. With practice it can go pretty fast.
    • Save as Excel with the following options: "Ignore text outside tables", and "Convert numeric values to numbers"
    • In Excel get rid of duplicate header rows and any unwanted repetitive text falling inside the table rectangle. If things like dates were only posted once and you need them on every row use Excel to fill in the repeats. Use macro's / VBA if needed. 
    • If you have the basic opening balance, transactions, closing balance be sure the sum of transaction detail rows matches the change in the balance.
    • I've had as much as 100% OCR accuracy! OK that was IMPRESSIVE ABBYY. Have done 10-15 pages with only a single row division error (two rows recognized as one). Very impressive but takes time to find.

    The software is very impressive. Documentation is nice quality but could be greatly expanded in the table area. For example, can you add vertical separators on page 1 and will they be saved and used when applying that area template to pages 2-N? (it seems like ONLY the rectangle plus the type (table) are saved to the template. Skip adding the separators on page one if they are not saved into the template.

    Thankfully I have 600 dpi scans from originals. A auto document feeder (ADF) was used. I suspect that slight left right shifting of the page in the ADF (or in the front back original printing of the documents shifting left/right on the page) could upset the template area recognition, but so far that has not been an issue.

    Its a mystery why you can't load a template (consisting of a rectangle typed table) on each page and have the option to have the "Analise table structure" module run right after the template is applied to each page. Would be great to have manual vertical separators  and or the verticals in the page 1 area saved into the template and then favored when "Analise Table Structure" is later run on page 2-N.

    Really the only need for "Analise Table Structure" on pages 2-N is to slice rows.

    Wish are area templates were XML and deeply documented, along with more documentation on how "Analise Table Structure" works to slice and dice. As mentioned above, would be great to have it favor manual separators, plus perhaps have a (information|warning|error) log?

    OCR sub system seems to have the ability to "learn new things". Would be great for the table slice / dice subsystem to have similar training capability. Yes its fuzzy, but there are things to "favor".

    Overall very impressive capabilities!

    0
  • Avatar
    Yuriy Korotkevych

    Hi RJL,

    Thank you for appreciation of FineReader PDF capabilities and putting down such a detailed summary of your findings! I'm sure it will be really useful for other people who work with this kind of complex multi-page tables. 

    Area template is definitely a tool that may help in dealing with such kind of tables. One just needs to understand its boundaries when using it. As you rightfully mentioned, if the tables on different pages are positioned differently or the scans are somewhat misplaced, the area template will not match the tables. There are no tools to automatically align a template to each page in FineReader PDF (there are other, enterprise-grade ABBYY solutions such as FlexiCapture that can do that sort of tasks), and one would need to manually inspect the pages in the project to check for area-table mismatches and manually adjust areas when needed.

    I'd like to comment on just a couple of things you've mentioned.

    1. "can you add vertical separators on page 1 and will they be saved and used when applying that area template to pages 2-N?" - Actually, it's possible to add a vertical separator to a table area manually (there's a button right next to the one for adding a horizontal one for that), and if you save the template anew after that, this separator will stay in the template. 

    2. "Its a mystery why you can't load a template (consisting of a rectangle typed table) on each page and have the option to have the "Analyze table structure" module run right after the template is applied to each page." - You can do that: right-click on the loaded template area and choose "Analyze table structure". It must be done separately for each page though, that's correct, and manually added or otherwise edited in the template separators will not be preferred (i.e. kept).

    Best regards,

    Yuriy

    0

Please sign in to leave a comment.