ABBYY FineReader Engine & XML Export

ABBYY FineReader Engine offers also a native XML format as an export of recognized document pages.
The XML export allows for different options. Below, we provide a sample for the character information:
  • XCA_None
    No character attributes are to be written in files in XML format.
  • XCA_Ascii
    Character coordinates and character confidence are to be written in files in XML format.
  • XCA_Basic
    Character coordinates are to be written in files in XML format.
  • XCA_Extended
    Character coordinates, character confidence and extended character attributes are to be written in files in XML format. The following extended attributes are written:
      • whether the word was found in the dictionary,
      • whether the word was recognized with a standard or user-defined language,
      • whether the word is a number,
      • whether the word is an identifier,
      • probability that a character is written with a Serif font,
      • penalty for discordance of characters in a word,
      • the mean width of stroke in the RLE representation of a word image.



ABBYY XML Tag Scheme

In FineReader Engine, the XML structure has the ability to save information of paragraph styles and roles in XML file.



Simple ABBYY XML Sample


To demonstrate the differences, the above image with the text 'Hallo World' was processed with ABBYY FineReader Server using the different XML export settings:


Processing the image with different options will demonstrate the principle structure of the native ABBYY XML Export. You can download a ZIP with the original tiff-file and the 5 different XML results here (in zip format).


XML Sample:


XML Character Attributes:


Extended XML Character Attributes:



Extended ABBYY XML Sample

This ZIP archive (1,3 MB) contains the processing results and the source image.



Zip content:



Have more questions? Submit a request



Please sign in to leave a comment.