Hi,
We are using the API to add Confidence values in the ALTO XML output. The API provides access to 'total number of characters' and 'total number of uncertain characters,' however we are not sure how to calculate CC (Character Confidence), WC (Word Confidence) and PC (Page Confidence) using these parameter values or any other way in RS. There are some online resources (eg, https://github.com/altoxml/schema/issues/23) but I am unsure how reliable those are.
I would appreciate any help you can provide.
Thanks!
We are using the API to add Confidence values in the ALTO XML output. The API provides access to 'total number of characters' and 'total number of uncertain characters,' however we are not sure how to calculate CC (Character Confidence), WC (Word Confidence) and PC (Page Confidence) using these parameter values or any other way in RS. There are some online resources (eg, https://github.com/altoxml/schema/issues/23) but I am unsure how reliable those are.
I would appreciate any help you can provide.
Thanks!
Comments
1 comment
Hello SAU,
In order to understand how to get information about uncertain characters you may use the following code sample.
The following code snippet should be placed into Document separation script in corresponding tab of your workflow setting.
// we will use standard file system object to save our data
var fso = new ActiveXObject("Scripting.FileSystemObject");
//Please specify your location for file name and create the folder
var statisticFilePath = "C:\\Temp\\stat.txt";
var statisticFile = fs
statisticFile.WriteLine( "!!!!!! START !!!!!!");
statisticFile.WriteLine( "File:" + this.InputFileProperties.FileName + "PageIndex:" + this.PageIndex);
// you can get info about count from Page statistics
statisticFile.WriteLine("Total characters count:" + this.Statistics.TotalCharacters + " Uncertain characters count:" + this.Statistics.UncertainCharacters);
//Let's try to calculate the statistics manually.
//Retrieve TEXT blocks from the RecognizedPage object
var blocks = this.TextBlocks;
var totalCharactersCount = 0;
var uncertainCharactersCount = 0;
//Iterate text blocks
for (var iBlock = 0; iBlock < blocks.count;="" iblock++="" )="">
{
var block = blocks.Item(iBlock);
//Obtain paragraphs from a text block
var paragraphs = block.Paragraphs;
//Iterate paragraphs
for (var iPar = 0; iPar < paragraphs.count;="" ipar++)="">
{
var paragraph = paragraphs.Item(iPar);
//Obtain Words from paragraph
for (var iWord = 0; iWord < paragraph.words.count;="">
{
var word = paragraph.Words.Item(iWord);
//Obtain Chars from Word
for (var iChar = 0; iChar < word.characters.count;="">
{
if (word.Characters.Item(iChar).IsSuspicious)
{
uncertainCharactersCount++;
}
totalCharactersCount++;
}
}
}
//TO DO add the same logic for Table Blocks
}
statisticFile.WriteLine("Total characters count calculated for each word in TEXT blocks only:" + totalCharactersCount + " Uncertain characters count:" + uncertainCharactersCount);
statisticFile.Close();
Please sign in to leave a comment.