Community

Setting Regex and Custom Dictionary for Different Regions in an Image in JAVA

I need to define a region with regex and another region with a custom dictionary in an image. For regex region, I tried to implement the logic based on the section `How to attach a dictionary to a recognition language` in the user guide as follows but it does not affect the result at all. May I know if the following code snippet is correct?

IRecognizerParams recognizerParams = engine.CreateRecognizerParams();
ILanguageDatabase languageDatabase = engine.CreateLanguageDatabase();
ITextLanguage textLanguage = languageDatabase.CreateTextLanguage();
IBaseLanguages baseLanguages = textLanguage.getBaseLanguages();
IBaseLanguage baseLanguage = baseLanguages.AddNew();
IDictionaryDescriptions dictionaryDescriptions = baseLanguage.getDictionaryDescriptions();
IDictionaryDescription dictionaryDescription = dictionaryDescriptions.AddNew(DictionaryTypeEnum.DT_RegularExpression);
IRegExpDictionaryDescription regExpDictionaryDescription = dictionaryDescription.GetAsRegExpDictionaryDescription();
regExpDictionaryDescription.SetText("(((|0)[1-9])|([12][0-9])|(30)|(31))\\-(((|0)[1-9])|(10)|(11)|(12))\\-((((19)|(20))[0-9][0-9])|([0-9][0-9]))");
baseLanguage.setAllowWordsFromDictionaryOnly(true);
// baseLanguage.setLetterSet(type, result); // no idea what the result parameter should be
recognizerParams.setTextLanguage(textLanguage);
region.AddRect(0, 100, 500, 125);
region.AddRect(0, 200, 500, 225);
document.getPages().getElement(0).getLayout().getBlocks().AddNew(BlockTypeEnum.BT_Text, region, 0);
document.Recognize( null, null );

For custom dictionary, we have a word list in tesseract's .user-words format (one word per line). What is the proper way to consume the .user-words file?

Thanks very much.

Related topic: https://forum.ocrsdk.com/thread/how-to-only-recognize-specified-region-of-the-image-in-java/

0

Comments

2 comments

  • Avatar
    Barry Choi

    From the interface file IFRDocument, it seems that only document.Analyze would accept recognizerParams as input so I added 

    document.Analyze(null, null, recognizerParams);

    before the document.Recognize( null, null ); statement.

    During execution, the following error occurred:

    The page cannot be analyzed. No basic languages with an alphabet are available. Please specify an alphabet.

    Having search 'alphabet' in the user manual and interface files, I am unable to find any clue to resolve this. 

    May I know if I'm on the right track?

    Thanks very much.

    0
    Comment actions Permalink
  • Avatar
    Denis Gusak

    Hi!

    Firstly, if you add regions manually using AddNew() method you have to specify recognition parameters for each of them manually too:

    document.getPages().getElement(0).getLayout().getBlocks().getElement(0).GetAsTextBlock().getRecognizerParams().setTextLanguage(textLanguage);

    Secondly, when creating a new BaseLanguage object, it is necessary not only to create and set dictionaries, but also set an alphabet via setLetterSet method:

    baseLanguage.setLetterSet(BaseLanguageLetterSetEnum.BLLS_Alphabet, "abcdefghi123456");

    Hope it helps!

    0
    Comment actions Permalink

Please sign in to leave a comment.