Community

Processable output format preserving text formatting, layout, and images

I'm looking for an output format preserving

  • text formatting,
  • layout (rudimentary), and
  • images

which also allows for being processed afterwards without tremendous effort.

As far as I can judge, right now, the options are as follows

  • XML - nicely provides processable layout, but omits text formatting and images (if any)
  • Alto XML - same here (does not make use of the FILEID attribute of type IllustrationType)
  • docx, xlsx, pptx - proprietary formats hard to process
  • txt - does not preserve layout, text formatting and images
  • rtf - does not preserve any images
  • PDF (pdfSearchable or pdfa) - does not provide any layout information
  • PDF (pdfTextAndImages) - preserves layout, text formatting and images, but extracting any information (especially layout) from the resulting PDF is nearly impossible

Unfortunately, all mentioned formats do not satisfy my need for the reasons given.

Am I missing something here? Any help is highly appreciated.

Thanks, Nico

Was this article helpful?

0 out of 0 found this helpful

Comments

5 comments

  • Avatar
    Permanently deleted user

    Hello, Nico,

    Have you tried to use the XML export together with the searchable PDF export? You could get layout info from xml, then get images and font info from pdf.

    As it is mentioned here "setting multiple export formats does not affect the cost of task processing".

    0
  • Avatar
    Permanently deleted user

    That might work -- provided I'd spend a lot of extra effort to apply a post-processing merging both output files. What's more, I guess sometimes this would turn out to be quite shaky. Hence, at least for me, it's not the way to go. What I'm looking for is one output file preserving all requirements mentioned above.

    0
  • Avatar
    Permanently deleted user

    Could you please specify why you can’t use the RTF export format? When you use our RTF export format pictures are embedded in the output file.

    0
  • Avatar
    Permanently deleted user

    When I was evaluating ABBYY OCR SDK in April 2013, I observed the situation stated in my initial question above: The RTF output file did not preserve any images contained in the submitted input. Does that mean it changed in the meanwhile? Do RTF output files now contain pictures as well? In all cases?

    0
  • Avatar
    Eugenia Meshcheryakova

    Could you please provide us with the images you're processing? We tested our system's RTF export on the sample documents containing images and they were OK. If it is more convenient, you can contact us at cloudocrsdk@abbyy.com.

    0

Please sign in to leave a comment.