Community

PDF to HTML in any way.

Hello,

I'm trying to use Cloud OCR SDK to convert PDF file to text in order to be able to have a structural HTML instead of XML which contains (almost) a tag/position information per character.

I'd like to ask that : - Do you know a XLST map that we can use to convert XML to HTML ? - Does Abbyy have any intention to provide such direct feature in the near future ?

Thanks! Zaf.

Was this article helpful?

0 out of 0 found this helpful

Comments

13 comments

  • Avatar
    Anastasia Galimova

    It is possible to make a support of the HTML export format without pictures.

    To make a solution, our analyst has asked for the following information:

    1. Do we understand correctly, that you do not need to export pictures (as the XML format does not contain it as well)?
    2. What is the purpose of converting PDF to HTML?
    3. Do you need to save the text formatting (font size etc.)?
    4. Do you need to save the document formatting (margins etc.)?
    0
  • Avatar
    ZaferK

    Hello Anastasia,

    We are not interesting in the image part of the PDFs such as backgrounds, logos, footers, separators. The important part for us is the text parts which we can use text-based information extractions.

    1. yes, we don't need to export images. Structural HTML with tables, paragraphs are fine.
    2. main purpose is to prepare PDF data to the information extraction and data mining.
    3. no, we don't need to save font sizes or any other CSS mainly. (they could be good separators to be used in the data extraction, but we can detect with structured/ordered HTML tags too)
    4. same answer as the previous ones.

    Thank you.

    0
  • Avatar
    ZaferK

    Hello,

    is there any progress or any development that you can share on this subject ?

    Thank you.

    0
  • Avatar
    Anastasia Galimova

    The analyst said that HTML export format should be added, but it will take some time, so he recommends to use the following workaround:

    1. recognize your file and perform export to pdf using ABBYY Cloud OCR SDK,
    2. convert the pdf with the recognized text to HTML as it is described in this post.
    0
  • Avatar
    ukrainecmk

    Hello. Can you please tell me - is PDF to HTML conversion implemented for now? If so - can you point me to documentation, samples or any other info that will help me to make such conversion?

    Regards, Alexey.

    0
  • Avatar
    ukrainecmk

    "convert the pdf with the recognized text to HTML as it is described in this post." But it is saying that it's impossible.

    0
  • Avatar
    ukrainecmk

    Any progress in this issue?

    0
  • Avatar
    Julia Anikushina

    You can convert PDF TextAndImages to HTML5 by means of PDF to HTML5 Converter.

    Is this method appropriate for you?

    0
  • Avatar
    ukrainecmk

    Hello. Thank you for responce. So, exactly Abbyy do not have such service, am I right?

    0
  • Avatar
    Julia Anikushina

    Unfortunately at the moment we don't have such functionality.

    Please create a feature request and describe your scenario there. Do you need to save formatting, pictures?

    0
  • Avatar
    ukrainecmk

    Hello. Yes, actually - what I need - to send pdf document to service, and get html code, for each page, just formatted as in pdf, but without: 1. javascript 2. any global selectors, if there will be css styles - they should be applied only for page html, and do not touch any html elements out of page (as I will insert this html code - into my html page, and do not use it as separate page) 3. it should have unique IDs for elements, or no IDs at all 4. I shoud be able after this to insert all pages html - in one final my own html page

    0
  • Avatar
    Julia Anikushina

    I have created a feature request for HTML export. Please vote there. Hope this functionality will be added in the future.

    0
  • Avatar
    ukrainecmk

    Thank you, I didn't found - how to vote there, I just placed new comment, hope this will help.

    0

Please sign in to leave a comment.