Form classification - snippet extraction

Packages of pages - let's say 50 per package. They are all forms

I need to identify the form I'm interested in from the package of 50 (classify images). These are all template forms.

Once I've got the right page, I then extract a text snippet.

I can see how the second part happens, but not the first. Any pointers on doing a form or document classification stage to a processing pipeline?

Was this article helpful?

1 out of 1 found this helpful

Comments

3 comments

  • Avatar
    Nikolay_Kh

    What you describe is a step further from OCR and closer to data capture scenarios. We are planning to implement layout training and document classification features to ABBYY Cloud OCR SDK, but i can't say anything about the timing right now. Meanwhile, i've got two suggestions on solving your task:

    1. Do a full OCR for your document (or a piece of document) and look for the document specific text (form ID code, form title, specific question, etc.). and select the template type respectively. That's one of the high-level approaches used in the technologies from the product in the next list item.

    2. Alternatively, have a look at ABBYY FlexiCapture Engine, it's a non-cloud based data capture SDK designed to solve the task you describe.

    I beleive both approaches would do the job for you, i suggest you use the first approach, as it would be easier to implement and if you feel that you need more data capture functionality - go for FlexiCapture Engine.

    0
  • Avatar
    Mithun

    Is there any update on the template implementation integrated with cloud ocr sdk?

    1
  • Avatar
    Alexey Zimarev

    It has been more than two years by now but no sign of such additionl to the cloud API. This is quite disappointing.

    0

Please sign in to leave a comment.