Question
How can I get a number of words in a recognized document?
Answer
You can use Words Object of each Paragraph to calculate the word count.
Please note that the word which this object contains is an internal entity. It is not guaranteed to coincide with the word as understood in the natural language, with the word as defined by regular expression, or with the sequence of characters which is separated from other words by spaces. The main purpose of the Word object is to provide recognition variants for the word.
//Process the document before getting the information about words
document.Process( null );
//Iterate through each Page of the Document
for (int i = 0; i < document.getPages().getCount(); i++)
{
IFRPage frPage = document.getPages().getElement(i);
//Iterate through each LayoutBlock of the Page
ILayoutBlocks LayoutBlocks = frPage.getLayout().getBlocks();
for (int currLayoutBlock = 0; currLayoutBlock < LayoutBlocks.getCount(); currLayoutBlock++)
{
//Check if the LayoutBlock is Text type
IBlock Block = LayoutBlocks.getElement(currLayoutBlock);
if (Block.getType() != BlockTypeEnum.BT_Text)
{
displayMessage("LayoutBlock #: " + currLayoutBlock + " is not a text block");
continue;
}
IParagraphs Paragraphs = Block.GetAsTextBlock().getText().getParagraphs();
//Iterate through each Paragraph of the TextBlock
for (int currParagraph = 0; currParagraph < Paragraphs.getCount(); currParagraph++)
{
IParagraph Paragraph = Paragraphs.getElement(currParagraph);
IIntsCollection FirstSymbolPositions = engine.CreateIntsCollection();
IWords Words = Paragraph.getWords();
int WordsCount = Words.getCount();
}
}
}
Java code snippet
Another possibility is to get the PlainText::Text of the FRDocument Object, and make a custom word calculation algorithm based, for example, on spaces count.
document.Process( null );
IPlainText PlainText = document.getPlainText();
String DocumentText = PlainText.getText();
Java code snippet
You can use Words Object of each Paragraph to calculate the word count.
Please note that the word which this object contains is an internal entity. It is not guaranteed to coincide with the word as understood in the natural language, with the word as defined by regular expression, or with the sequence of characters which is separated from other words by spaces. The main purpose of the Word object is to provide recognition variants for the word.
Java code snippet
Another possibility is to get the PlainText::Text of the FRDocument Object, and make a custom word calculation algorithm based, for example, on spaces count.
Java code snippet