I'm looking for an output format preserving
- text formatting,
- layout (rudimentary), and
- images
which also allows for being processed afterwards without tremendous effort.
As far as I can judge, right now, the options are as follows
- XML - nicely provides processable layout, but omits text formatting and images (if any)
-
Alto XML - same here (does not make use of the
FILEIDattribute of typeIllustrationType) - docx, xlsx, pptx - proprietary formats hard to process
- txt - does not preserve layout, text formatting and images
- rtf - does not preserve any images
-
PDF (
pdfSearchableorpdfa) - does not provide any layout information -
PDF (
pdfTextAndImages) - preserves layout, text formatting and images, but extracting any information (especially layout) from the resulting PDF is nearly impossible
Unfortunately, all mentioned formats do not satisfy my need for the reasons given.
Am I missing something here? Any help is highly appreciated.
Thanks, Nico
Comments
5 comments
Hello, Nico,
Have you tried to use the XML export together with the searchable PDF export? You could get layout info from xml, then get images and font info from pdf.
As it is mentioned here "setting multiple export formats does not affect the cost of task processing".
That might work -- provided I'd spend a lot of extra effort to apply a post-processing merging both output files. What's more, I guess sometimes this would turn out to be quite shaky. Hence, at least for me, it's not the way to go. What I'm looking for is one output file preserving all requirements mentioned above.
Could you please specify why you can’t use the RTF export format? When you use our RTF export format pictures are embedded in the output file.
When I was evaluating ABBYY OCR SDK in April 2013, I observed the situation stated in my initial question above: The RTF output file did not preserve any images contained in the submitted input. Does that mean it changed in the meanwhile? Do RTF output files now contain pictures as well? In all cases?
Could you please provide us with the images you're processing? We tested our system's RTF export on the sample documents containing images and they were OK. If it is more convenient, you can contact us at cloudocrsdk@abbyy.com.
Please sign in to leave a comment.