I was thinking of using the service for generating records from received invoices, which have the same information in multiple different possible formats, including freetext documents written by consultants. As they share the fields required by the law, I was thinking of using regExp queries, but failed noticing that field extraction works well only when field position is accurately specified.
Are you going to work in the direction of identifying fields by the expected characteristics such as nearby titles or expected type and dimension, or should I think of mapping the entire page to a database o words and their positions and manage the search myself?
Current Cloud OCR SDK API is about text recognition only, would it be full text or just a field. It deals nothing with finding a zone on an image.
We have data capture SDK (ABBYY FlexiCapture Engine) which has required ability. It is not mapped to the Cloud yet, but we are thinking about that. That will take certain time.
Right now I see two possible ways of doing what you want:
Best regards, Dmitry. ABBYY, Lead Product Analyst, SDK products.
Actually, ABBYY is long time working in that direction. We have product called FlexiCapture and SDK called FlexiCapture Engine They all salve taks you have just described - they can help extracting particular data from semi-structured documents. Using FlexiLayout Studio you can define fields you want to extract and rules how to locate them on image. It is not just regular expression, it can define complicate dependencies with voting amond different layout hypotises, and even fields cross-checking and database look-ups for values.
Unfortunately this is not yet available in the Cloud since it does require special training on FlexiLayout programming.
So just please contact nearest ABBYY representative to talk about FlexiCapture product or Engine.
Please sign in to leave a comment.