When a single table has many rows, that spans perhaps 10 pages in a PDF file, yet it has the same column structure across all 10 pages, is it possible to get a single table output?
Example: monthly bank or brokerage statements that list all the transactions in an account for the month, perhaps over a span of 5-10 pages. Columns are consistent across all pages BUT the bank/brokerage firm probably used a "Report Writer" off a mainframe where things like repeating dates are suppressed past the first one to clean up / simplify the print out. In the worse case the statements may have odd and even page layout differences to accommodate a three hole punch. Column headings repeat on each page and outside the table area are titles, page numbers, "continued on the next page", etc. Just want one, clean table output.
ADRT (Adaptive Document Recognition Technology) A technology that increases the quality of
conversion of multi-page documents. For example, it can recognize such elements as headings,
headers and footers, footnotes, page numbering, and signatures.
ADRT feature seems spot on. Very new to OCR editor, have only tried to use it for a few hours trying to do the above. Wish documentation covered best practices around recognizing multi-page tables (strikes me as a very frequent use-case.) Everything seems very PAGE by PAGE oriented (one table per page).
Of course can clean up things post OCR in Excel using VBA scripts. E.g. fill in blanks from above where the Report Writer suppressed them, remove every page embedded column headings, etc. But realigning columns and dealing with radically different column widths/positions on each page as little "island" is a pain.
I suspect the ADRT "under the hood" can easily do this - perhaps I am confused by the GUI or the early documentation on all the table features (especially multi page).
Any tips or techniques most appreciated.