What export format outputs words with their bounding boxes?

Question

I want the words and their coordinates to be written in the output file. Which export format should I choose?

Answer

Exporting to the ALTO format allows you to have words and coordinates of their bounding boxes in the output file. For example:

<TextBlock ID="Page1_Block1" HEIGHT="43" WIDTH="499" VPOS="150" HPOS="1879" language="en-US">
<TextLine HEIGHT="31" WIDTH="487" VPOS="156" HPOS="1885">
<String WC="1." CONTENT="ABBYY" HEIGHT="30" WIDTH="145" VPOS="156" HPOS="1885"/>
<SP WIDTH="13" VPOS="156" HPOS="2031"/>
<String WC="0.98600000143051147" CONTENT="FineReader" HEIGHT="31" WIDTH="224" VPOS="156" HPOS="2045"/>
<SP WIDTH="11" VPOS="156" HPOS="2270"/>
<String WC="0.72333335876464844" CONTENT="OCR" HEIGHT="31" WIDTH="90" VPOS="156" HPOS="2282"/>
</TextLine>
</TextBlock>

Also, you can get bounding boxes of words via API using the Region property of the Word object.

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.