Output XML document

The processImage, processDocument methods can return recognized text in XML format (if the exportFormat parameter is set to xml or xmlForCorrectedImage). This format contains recognized text, with structure and parameters which are described with the help of XML.

You can find the description of the main tags of this XML file in the table below. See also the XML schema of an XML document.

Name Description
document The root tag. Represents a recognized document. Contains a sequence of page elements and a documentData element. The tag has the following attributes:
  • version — XML version
  • producer — the producer of the XML file
  • languages — (optional) all languages of the document
page Recognized page. It is a sequence of block tags. The tag can have the following attributes:
  • width — the image width in pixels
  • height — the image height in pixels
  • resolution — the image resolution in pixels per inch
  • originalCoords — (optional) if the value is true, all coordinates are relative to the original image before opening (this will be the case if you set exportFormat to xml), if it is false they are relative to the opened (deskewed) image (this will be the case if you set exportFormat to xmlForCorrectedImage
  • rotation — (optional) the type of rotation applied to the original page image. Can have one of the following values: Normal, RotatedClockwise, RotatedUpsidedown, RotatedCounterclockwise (the default value is Normal)
block
(BlockType)

Recognized block. Each such tag includes the region element, which specifies the region of the block on an image.

The tag has the blockType attribute, which denotes the type of the block: Text, Table, Picture, Barcode, Separator, SeparatorsBox. The value of this attribute defines which elements the tag includes:

  • text — available only if blockType attribute is Text
  • row — available only if blockType attribute is Table
  • separatorsBox — available only if blockType attribute is SeparatorsBox
  • separator — available only if blockType attribute is Separator
region Block region, a set of rectangles. Includes one or several rect elements.
rect

Rectangle of a block region.

The tag has the following attributes:

  • l — the coordinate of the left border of the rectangle
  • t — the coordinate of the top border of the rectangle
  • — the coordinate of the right border of the rectangle
  • b — the coordinate of the bottom border of the rectangle
text
(TextType)
Text of a recognized text block or text of a table cell. Contains par elements.

The tag can have the following attributes:

  • orientation — (optional) the text orientation. Can have one of the following values: Normal, RotatedClockwise, RotatedUpsidedown, RotatedCounterclockwise (the default value is Normal)
  • mirrored — (optional) specifies if the text is mirrored (the default value is false)
  • inverted — (optional) specifies if the text is inverted (the default value is false)
par
(ParagraphType)
Paragraph of a recognized text. Contains line elements.

The tag can have the following attributes:

  • dropCapCharsCount — (optional) the number of drop caps in the paragraph (the default value is 0)
  • dropCap-l — (optional) the left coordinate of the drop cap rectangle
  • dropCap-t — (optional) the top coordinate of the drop cap rectangle
  • dropCap-r — (optional) the right coordinate of the drop cap rectangle
  • dropCap-b — (optional) the bottom coordinate of the drop cap rectangle
  • align — (optional) the paragraph aligning. Possesses one of the following values: Left, Center, Right, Justified (the default value is Left)
  • leftIndent — (optional) the left paragraph indent (the default value is 0)
  • rightIndent — (optional) the right paragraph indent (the default value is 0)
  • startIndent — (optional) the indent of the first line of the paragraph (default value is 0)
  • lineSpacing — (optional) the spacing between lines (the default value is 0)
line
(LineType)
Line of a paragraph. Contains formatting elements.

The tag has the following attributes:

  • baseline — the distance from the base line to the top edge of the page
  • l — the coordinate of the left border of the surrounding rectangle,
  • t — the coordinate of the top border of the surrounding rectangle
  • r — the coordinate of the right border of the surrounding rectangle
  • b — the coordinate of the bottom border of the surrounding rectangle
formatting
(FormattingType)
Group of characters with uniform formatting. It is a group of charParams elements.

It has the lang attribute, which specified the name of the language, which has been used for recognition.

charParams
(CharParamsType)
Attributes of a single character. The tag can include charRecVariants element (if the xml:writeRecognitionVariants parameter of a processing method has been set to true).  

The tag can have the following attributes:

  • l — the coordinate of the left border of the character rectangle
  • t — the coordinate of the top border of the character rectangle
  • r — the coordinate of the right border of the character rectangle
  • b — the coordinate of the bottom border of the character rectangle
  • suspicious — (optional) this property set to true means that the character was recognized uncertainly
  • isTab — (optional) this property set to true means that the character is a tab
charRecVariants

Variants of a character recognition (available only if the xml:writeRecognitionVariants parameter of a processing method has been set to true). Contains charRecVariant elements. Has no attributes.

charRecVariant
(CharRecognition­Variant)

A variant of a character recognition (available only if the xml:writeRecognitionVariants parameter of a processing method has been set to true).

The tag can have the following attributes:

  • charConfidence — the estimate of probability that this recognition variant is correct
  • serifProbability — the estimate of probability that this character is written in a Serif font
row
(TableRowType)
Table row (available if blockType attribute is Table). Includes cell elements. Has no attributes.
cell Table cell (available if blockType attribute is Table). It is a a sequence of text tags.

The tag can have the following attributes:

  • colSpan — (optional) column span
  • rowSpan — (optional) row span
  • align — (optional) this property specifies alignment for a tab stop and can have one of the following values: Top, Center, Bottom (the default value is Top)
  • picture — (optional) specifies if the cell contains only a picture (the default value is false)
  • leftBorder — (optional) the table cell left border type. Can have one of the following values: Absent, Unknown, White, Black (the default value is Black)
  • topBorder — (optional) the table cell top border type. Can have one of the following values: Absent, Unknown, White, Black (the default value is Black)
  • rightBorder — (optional) the table cell right border type. Can have one of the following values: Absent, Unknown, White, Black (the default value is Black)
  • bottomBorder — (optional) the table cell bottom border type. Can have one of the following values: Absent, Unknown, White, Black (the default value is Black)
  • width — the width of the cell
  • height — the height of the cell
separatorsBox Group of separators (available if blockType attribute is SeparatorsBox). It is a sequence of separator tags. Has no attributes.
separator
(SeparatorBlockType)
Single separator (available if blockType attribute is Separator) or separator in a group of separators. Includes start and end elements. Has the following attributes:
  • thickness — specifies the precise width of the separator in pixels
  • type — specifies the type of the separator. Can have one of the following values: Unknown, Black, Dotted
start
(Point type)
Start point of a separator. Has the following attributes:
  • x — specifies the horizontal coordinate of the start point of separator
  • y — the vertical coordinate of the start point of separator
end
(Point type)
End point of a separator. Has the following attributes:
  • — specifies the horizontal coordinate of the end point of separator
  • y — the vertical coordinate of the end point of separator
documentData General formatting properties and document structure. Contains paragraphStyles and sections elements.
paragraphStyles Paragraph formatting styles. Contains a sequence of paragraphStyleelements.

paragraphStyle
(Paragraph­StyleType)

Formatting style for one paragraph. Includes a fontStyle element. Has the following attributes:

  • id — the identifier of the paragraph
  • name — the name of the paragraph style
  • mainFontStyleId — the main font style of the paragraph
  • role — the paragraph role. Can have one of the following values: text, tableText, heading, tableHeading, pictureCaption, tableCaption, contents(table of contents), footnote, endnote, rt (running title), garb (garbage), other, barcode, headingNumber
  • roleLevel — (optional) (the default value is -1, which means that the level is not available for this role)
  • align — paragraph alignment. It can be one of the following values: Left, Center, Right, Justified, CjkJustified, ThaiJustified
  • before — (optional) space before the paragraph of this style (the default value is 0)
  • after — (optional) space after the paragraph of this style (the default value is 0)
  • startIndent — (optional) indent of the first line of the paragraph
  • leftIndent — (optional) left indent of the whole paragraph
  • rightIndent — (optional) right indent of the whole paragraph
  • lineSpacing — (optional) line spacing
  • lineSpacingRatio — (optional) line spacing (proportional to the letter height)
  • fixedLineSpacing — (optional) if true, the line spacing in the paragraph does not vary
fontStyle
(FontStyleType)

Font style. Has the following attributes:

  • id — the identifier of the font style
  • baseFont — (optional)
  • italic — (optional) if true, the font is italic
  • bold — (optional) if true, the font is bold
  • underline — (optional) if true, the font is underlined
  • strikeout — (optional) if true, the font is strikeout
  • smallcaps — (optional) if true, the font is small caps
  • scaling — (optional) the scaling of the font (the default value is 1000)
  • spacing — (optional) the character spacing (the default value is 0)
  • color — (optional) the color of the font (the default value is 0)
  • backgroundColor — (optional) the background color (the default value is 0)
  • ff — the name of the font
  • fs — the size of the font
sections Contains a sequence of section elements.
section
(SectionType)
A document section. Includes a stream element.
stream
(TextStreamType)

A sequence of paragraphs and blocks. Includes mainText and elemId elements. Has the following attributes:

  • role — (optional) the stream role. It can be one of the following values: garb, text, footnote, incut (the default value is text)
  • vertCjk — (optional) if true, the stream contains vertical CJK text
  • beginPage — the number of page on which the stream begins
  • endPage — (optional) the number of page on which the stream ends
mainText

The text of the stream. Has the following attributes:

  • rtl — (optional) if true, the text is a running title
  • columnCount — the number of columns
elemId

The element's identifier. Has the following attribute:

  • id — string ID of the element

Have more questions? Submit a request

Comments

5 comments

  • Avatar

    György Görög

    Hi there, what are the units? A line height = 920, what does it mean? Thanks.

    0
  • Avatar

    György Görög

    Nothing about wordRecVariants.

     

    0
  • Avatar

    György Görög

    A couple of months ago there was a WordInDictionary boolean that seems to be missing now.

    0
  • Avatar

    Vijay Sankaran

    Can the XML output include word objects along with their coordinates (just like you have the line object in the above article) that have been found by the analysis? If not, could you please add this to your feature Roadmap for Fine Reader Engine for Linux?

    0
  • Avatar

    Ivan Phoon

    Can u share a sample XML output with a few converted text?

    0

Please sign in to leave a comment.