So, my task is scraping data from table-like structures from images and there are two types of scripts in images, cyrillic and latin, as well as numbers. My current value for languange in a query is 'Russian,Digits,English'.
Unfortunately, OCR performance isn't perfect, especially in regards to numbers and quite often i get something like "^,32" instead of "4,32" and such. What's worse, behavior isn't particulary consistent either (in fact it feels like OCR engine uses image-specific context on the backend, because some images seem totally fine, while some have quite a lot of similar errors) and my data is quite diverse (it's technical documentation for construction projects), so i am not sure if i can unambiguosly solve it (or even detect what there is an issue) with regular expressions. XML response sometimes has "suspicious" attribute for a character, but it is there quite often and it seems fine in most cases, plus i'd like to avoid using heavy client-side OCR, it is why we turned to rather expensive abby OCR in the first place.
So, my question: is there a way to gain more direct acces to OCR engine with maybe custom dictionaries and data like model confidence for each character for each specified script and does system charge you a page again if you send a request with the same image but different set of scripts (for example only "Digits")?
Comments
0 comments
Please sign in to leave a comment.