Reorder recurring pdf tables & lines

I have a pdf of 1,000 pages on which each page has the same format with 10 small tables, most 2 or 3 lines long and a few cells wide. Some tables have a merged cell or two. Two of the tables may vary in length but those tables I don't need.

For it to be useful I need the OCR Editor to convert the pdf to an Excel format where all the information on each page is rearranged to be on the same row.

Further, I need FineReader to do this automatically as it comes to each new page so that instead of 1,000 pages of tables I end up with 1,000 rows.

Is this possible?



1 comment

  • Avatar
    Nadezhda A. Solovyeva

    FineReader Engine can convert the PDF with tables to Excel/CSV format. For your scenario implementation, I would suggest the following:

    1) Read all pages using setting DocumentProcessingParams.PageProcessingParams.PageAnalysisParams.AggressiveTableDetection = true to make FREngine detect as many tables as possible

    2) If your table structure allows that, then use TableAnalysisParams.SplitOnlyBySeparators = true and TableAnalysisParams.SingleLinePerCell = true to make cells detection more accurate

    If these steps get your tables extracted, then you will be able to continue. Otherwise, please use a more advanced structured document OCR tool, such as FlexiCapture Engine.

    3) Export the result to CSV

    4) Post-process the CSV using text manipulation functions to achieve the result. 


Please sign in to leave a comment.