Hello,
I'm trying to use Cloud OCR SDK to convert PDF file to text in order to be able to have a structural HTML instead of XML which contains (almost) a tag/position information per character.
I'd like to ask that : - Do you know a XLST map that we can use to convert XML to HTML ? - Does Abbyy have any intention to provide such direct feature in the near future ?
Thanks! Zaf.
Comments
13 comments
It is possible to make a support of the HTML export format without pictures.
To make a solution, our analyst has asked for the following information:
Hello Anastasia,
We are not interesting in the image part of the PDFs such as backgrounds, logos, footers, separators. The important part for us is the text parts which we can use text-based information extractions.
Thank you.
Hello,
is there any progress or any development that you can share on this subject ?
Thank you.
The analyst said that HTML export format should be added, but it will take some time, so he recommends to use the following workaround:
Hello. Can you please tell me - is PDF to HTML conversion implemented for now? If so - can you point me to documentation, samples or any other info that will help me to make such conversion?
Regards, Alexey.
"convert the pdf with the recognized text to HTML as it is described in this post." But it is saying that it's impossible.
Any progress in this issue?
You can convert PDF TextAndImages to HTML5 by means of PDF to HTML5 Converter.
Is this method appropriate for you?
Hello. Thank you for responce. So, exactly Abbyy do not have such service, am I right?
Unfortunately at the moment we don't have such functionality.
Please create a feature request and describe your scenario there. Do you need to save formatting, pictures?
Hello. Yes, actually - what I need - to send pdf document to service, and get html code, for each page, just formatted as in pdf, but without: 1. javascript 2. any global selectors, if there will be css styles - they should be applied only for page html, and do not touch any html elements out of page (as I will insert this html code - into my html page, and do not use it as separate page) 3. it should have unique IDs for elements, or no IDs at all 4. I shoud be able after this to insert all pages html - in one final my own html page
I have created a feature request for HTML export. Please vote there. Hope this functionality will be added in the future.
Thank you, I didn't found - how to vote there, I just placed new comment, hope this will help.
Please sign in to leave a comment.