Community

individual word coordinates in xml

Written by Permanently deleted user

April 29, 2018 23:04
4

Is there a way to get individual word coordinates in xml output of processImage() ? Currently I have coordinates for lines, but I need bounding box for each word.

thank you!

Was this article helpful?

0 out of 0 found this helpful

Comments

4 comments

Permanently deleted user

May 01, 2018 12:20
Hi,

Unfortunately, there isn't such feature in Export to XML. Nevertheless, you can use the following workarounds:

Extract Region object from each Word in Paragraph::Words object. The Region object stores coordinates of its area. The Paragraph itself can be obtained from Page::Layout::LayoutBlocks::Block for each type of Block separately.

Calculate the work coordinates from coordinates of its characters. They may be obtained in XML output after setting XMLExportParams::WriteCharAttributes = XCA_Basic.
1
Csaba Hajnal

May 24, 2019 07:46
Tigran, try out ALTO XML export, it contains word-level information

0
Csaba Hajnal

May 24, 2019 12:48
I don't understand you exactly. Paragraph::Words:Word hasn't got any Region data. (I use FineReader Engine SDK v11)

0
Permanently deleted user

May 24, 2019 18:28
Hi Tigran,

If in any case you are working with Python, you can with PDFMiner;

Python 3:
https://github.com/pdfminer/pdfminer.six

Python 2:
https://pypi.org/project/pdfminer/

It seems that Apache PDFBox (I have not tried that part of PDFBox) is also capable of doing so
https://stackoverflow.com/questions/33427686/getting-bounding-boxes-of-text-lines-from-a-pdf-using-pdfbox

We use PDFMiner and PDFBox next to ABBYY FineReader

Best regards
Koen de Leijer

0

Please sign in to leave a comment.

Community

individual word coordinates in xml

Was this article helpful?

Comments

Didn't find what you were looking for?