Number of words in a document


How can I get a number of words in a recognized document?


You can use Words Object of each Paragraph to calculate the word count.

Please note that the word which this object contains is an internal entity. It is not guaranteed to coincide with the word as understood in the natural language, with the word as defined by regular expression, or with the sequence of characters which is separated from other words by spaces. The main purpose of the Word object is to provide recognition variants for the word.


//Process the document before getting the information about words
document.Process( null );

//Iterate through each Page of the Document for (int i = 0; i < document.getPages().getCount(); i++) { IFRPage frPage = document.getPages().getElement(i); //Iterate through each LayoutBlock of the Page ILayoutBlocks LayoutBlocks = frPage.getLayout().getBlocks(); for (int currLayoutBlock = 0; currLayoutBlock < LayoutBlocks.getCount(); currLayoutBlock++) { //Check if the LayoutBlock is Text type IBlock Block = LayoutBlocks.getElement(currLayoutBlock); if (Block.getType() != BlockTypeEnum.BT_Text) { displayMessage("LayoutBlock #: " + currLayoutBlock + " is not a text block"); continue; } IParagraphs Paragraphs = Block.GetAsTextBlock().getText().getParagraphs(); //Iterate through each Paragraph of the TextBlock for (int currParagraph = 0; currParagraph < Paragraphs.getCount(); currParagraph++) { IParagraph Paragraph = Paragraphs.getElement(currParagraph); IIntsCollection FirstSymbolPositions = engine.CreateIntsCollection(); IWords Words = Paragraph.getWords(); int WordsCount = Words.getCount(); } } }

Java code snippet


Another possibility is to get the PlainText::Text of the FRDocument Object, and make a custom word calculation algorithm based, for example, on spaces count.

document.Process( null );

IPlainText PlainText = document.getPlainText();
String DocumentText = PlainText.getText();

Java code snippet

Have more questions? Submit a request



Please sign in to leave a comment.