Setting Regex and Custom Dictionary for Different Regions in an Image in JAVA

Written by Permanently deleted user

2018年02月21日 10:30
2

I need to define a region with regex and another region with a custom dictionary in an image. For regex region, I tried to implement the logic based on the section `How to attach a dictionary to a recognition language` in the user guide as follows but it does not affect the result at all. May I know if the following code snippet is correct?

IRecognizerParams recognizerParams = engine.CreateRecognizerParams();
ILanguageDatabase languageDatabase = engine.CreateLanguageDatabase();
ITextLanguage textLanguage = languageDatabase.CreateTextLanguage();
IBaseLanguages baseLanguages = textLanguage.getBaseLanguages();
IBaseLanguage baseLanguage = baseLanguages.AddNew();
IDictionaryDescriptions dictionaryDescriptions = baseLanguage.getDictionaryDescriptions();
IDictionaryDescription dictionaryDescription = dictionaryDescriptions.AddNew(DictionaryTypeEnum.DT_RegularExpression);
IRegExpDictionaryDescription regExpDictionaryDescription = dictionaryDescription.GetAsRegExpDictionaryDescription();
regExpDictionaryDescription.SetText("(((|0)[1-9])|([12][0-9])|(30)|(31))\\-(((|0)[1-9])|(10)|(11)|(12))\\-((((19)|(20))[0-9][0-9])|([0-9][0-9]))");
baseLanguage.setAllowWordsFromDictionaryOnly(true);
// baseLanguage.setLetterSet(type, result); // no idea what the result parameter should be
recognizerParams.setTextLanguage(textLanguage);
region.AddRect(0, 100, 500, 125); 
region.AddRect(0, 200, 500, 225); 
document.getPages().getElement(0).getLayout().getBlocks().AddNew(BlockTypeEnum.BT_Text, region, 0);
document.Recognize( null, null );

For custom dictionary, we have a word list in tesseract's .user-words format (one word per line). What is the proper way to consume the .user-words file?

Thanks very much.

2件のコメント

Permanently deleted user

2018年02月22日 03:01
From the interface file IFRDocument, it seems that only document.Analyze would accept recognizerParams as input so I added

document.Analyze(null, null, recognizerParams);

before the document.Recognize( null, null ); statement.

During execution, the following error occurred:

The page cannot be analyzed. No basic languages with an alphabet are available. Please specify an alphabet.

Having search 'alphabet' in the user manual and interface files, I am unable to find any clue to resolve this.

May I know if I'm on the right track?

Thanks very much.
0
Permanently deleted user

2018年03月01日 17:11
Hi!

Firstly, if you add regions manually using AddNew() method you have to specify recognition parameters for each of them manually too:

document.getPages().getElement(0).getLayout().getBlocks().getElement(0).GetAsTextBlock().getRecognizerParams().setTextLanguage(textLanguage);

Secondly, when creating a new BaseLanguage object, it is necessary not only to create and set dictionaries, but also set an alphabet via setLetterSet method:

baseLanguage.setLetterSet(BaseLanguageLetterSetEnum.BLLS_Alphabet, "abcdefghi123456");

Hope it helps!
0

サインインしてコメントを残してください。

コミュニティ

Setting Regex and Custom Dictionary for Different Regions in an Image in JAVA

この記事は役に立ちましたか？

コメント

お探しのものを見つけられませんでしたか？