Community

How to detect Bold/Italic characters in abbyy ocr cloud service ?

I am using ocr cloud sdk in my application for ocr. I want to know that how to detect bold/italic characters in xml returned by cloud service. I am using following code to get xml

String url = string.Format("http://cloud.ocrsdk.com/processImage?language={0}&exportFormat={1},{2}", language, exportFormat, "xml");

var request = CreateRequest(url, "POST", Credentials, Proxy);

In this link following answer is posted by a user.

The parameter xcf (--xmlWriteCharFormatting) is neccessary to get the font size.

I want to know if this parameter also works in cloud ocr service or there is some other parameters/way to detect bold/italic characters.

0

Comments

11 comments

  • Avatar
    Oksana Serdyuk

    The XML output format that ABBYY Cloud OCR SDK creates is the same that ABBYY FineReader Engine creates with the default options. It contains information on text and characters, but does not support the character formatting information, such as bold/italic/underlined font styles. Unfortunately, now the XML scheme in Cloud OCR SDK can only be expanded by setting the parameter xml:writeRecognitionVariants to true (it specifies whether the variants of characters recognition should be written to the output file).

    Please vote if you want to have the information about text font styles for the feature request: http://forum.ocrsdk.com/questions/3693/feature-request-font-info-in-xml. Hope it will be implemented in future.

    1
    Comment actions Permalink
  • Avatar
    Oksana Serdyuk

    Hi,

    We are happy to inform you that the requested functionality has been recently implemented in ABBYY Cloud OCR SDK. Now it is possible to get information about the paragraph and character styles in the XML export format. For this please use the xml:writeFormatting parameter of the processImage or processDocument methods and set it to true (by default it is false).

    0
    Comment actions Permalink
  • Avatar
    Vishnu Vardhan

    Hi,

    I'm trying to use xml:writeFormatting parameter to get paragraph or line format information. But i came across style attribute in par and formatting tags. What does it represent?

    Example :-  

     <par align="Justified" style="{FFFFFFFF-FFFF-FFFF-FFFF-FFFFFFFFFFFF}">   

       <line>..

           <formatting lang="EnglishUnitedStates" ff="Arial" fs="10." underline="1" style="{99A6515E-DE65-4325-9F9D-99D8674C0010}">

       ..

    ..

       </line>

    </par>

     

     

    Thanks,

    Vishnu

    0
    Comment actions Permalink
  • Avatar
    Vishnu Vardhan

    I could see some italic attribute detected but i can't able to find any bold attribute in formatting/charParams tag from the xml response, though the uploaded documents(clear ones) contain bold words.

    0
    Comment actions Permalink
  • Avatar
    Helen Osetrova

    Hi Vishnu,

     

    The information about the font attributes could be found under the fontStyle tag of the output XML document. If there is no bold attributes in the output XML, it means that text of the source document has not been treated as bold.

     

    For more specific recommendations, could you post here the source document?

     

    0
    Comment actions Permalink
  • Avatar
    Vishnu Vardhan

    Hi Vishnu,

    The information about the font attributes could be found under the fontStyle tag of the output XML document. If there is no bold attributes in the output XML, it means that text of the source document has not been treated as bold.

    For more specific recommendations, could you post here the source document?

     Hi helen,

    Here is a sample image in which only italic is detected.

     

    But i could able to retrieve bold words when i try as suggested by oksana here

    Is it possible that i could able to get formatting information(bold, etc) using "text detection" profile itself?

    Thanks,

    Vishnu

    0
    Comment actions Permalink
  • Avatar
    Helen Osetrova

    Hi Vishnu,

     

    It is possible to get the information about formatting attributes using the textExtraction  profile. For your document, we can also suggest using the imageSource=scanner option, so the request to the server will look as follows:

    string url = "http://cloud.ocrsdk.com/processImage?profile=textExtraction&imageSource=scanner&
                  exportFormat=xml&xml:writeFormatting=true";

     

    Please find attached the XML file obtained using these settings.

     

    0
    Comment actions Permalink
  • Avatar
    Vishnu Vardhan

    Hi Helen,

       In the documentation it says "auto" mode is capable of detecting imageSource of the document automatically. But it sometimes treats scanned image as a photo/captured image.

    Thanks,

    vishnu

    0
    Comment actions Permalink
  • Avatar
    Helen Osetrova

    Hi Vishnu,

     

     

    The thing is that Cloud OCR SDK is designed with the assumption that most of users uploads photographed documents. For this reason, with the imageSource=auto parameter Cloud OCR SDK sometimes treats the scanned documents as photos. To avoid such behavior kindly apply the imageSource=scanner setting.

     

     

    Hope this information will be helpful!

    0
    Comment actions Permalink
  • Avatar
    Vishnu Vardhan

    Hi Helen,

    I see line spacing option in some paragraph tags like


    <par lineSpacing="3600" style="{FFFFFFFF-FFFF-FFFF-FFFF-FFFFFFFFFFFF}">

    What does 3600 represents? Also I would like to know how it can be useful if it is not available for every par/line tags.

    Thanks,

    Vishnu

    0
    Comment actions Permalink
  • Avatar
    Helen Osetrova

    Hi Vishnu,

     

    The lineSpacing attribute represents the space between two lines in the paragraph. Kindly learn the description of main XML tags used in Cloud OCR SDK from the Output XML document article. 

     

    0
    Comment actions Permalink

Please sign in to leave a comment.