The image of a phonebook page column is here: http://digitalfire.com/culiacan/pictures/309.jpg It was scanned at 600 dpi and resized in Photoshop to 300 (without resampling). We are passing a parameter to read Spanish.
The recognizer is not getting the phone numbers correct on the last 30 or 40 lines (they are chopped off on the right, more digits are missing on numbers nearer the bottom). Also, we are getting a high frequency of errors (for the 60 or so OCRs we have tested so far) where it is reading '-0' as '4)' (7494)211 instead of 749-0211), 'LI' as 'U' and '96' as '%'. This same image reads with much fewer errors using other recognition service. It is also failing to interpret the period-tab as a tab.
Any ideas? Thanks.
Comments
6 comments
Hi there! We've thoroughly examined your case. Your scanned image looks a lot like being taken from camera, so our image preprocessing engine tries to enhance it with some photo preprocessing algorithms which occasionally corrupt a bit of details.
The first thing you need to do is to add
&imageSource=scanner
to yourprocessImage
call. That would disable photo preprocessing and slightly improve your results.We're also currently working on implementing another option that would increase results even more for your type of images. I'll let you know as soon as it's released to production (that would take several days i think). Please let me know if you have any additional questions.
Have done this, it does seem better, will check more pages. However the phones numbers on the last 50 or so lines are still being cut off. The page is at a light angle, is it sensitive to this?
You might want to try http://digitalfire.com/culiacan/pictures/270.jpg We are getting strange ZZZZZ sequrences where the period-tabs are.
The ZZZZ appear because there is a light skew in a document. The engine tries to fix it, makes wrong guess because of uncommon shape of the image and gets ZZZZ and other recognition errors. There will be an option to disable automatic deskewing soon and it will make your results better.
However, we are unable to reproduce your problem with missing numbers on last lines. If you set "imageSource=scanner", all the lines and numbers from top to bottom appear in the result text file. Our server logs show that for your application there were no tasks with imageSource=scanner option for the last day
There is now an option to disable automatic skew correction so you get most of your images.
Specify the following parameters: "?language=Spanish&imageSource=scanner&correctSkew=false"
I tried this and it seems alot better (309.jpg). We got about 6 or 8 hyphens interpreted as bullets. Shouldn't the fact that there are a hundred other hyphens on the page influence the recognizer's judgement deciding whether to interpret as a hyphen or bullet? Another thing it is still doing is failing to recognize some of the spaces between the words when they seem obvious to the eye.
Please sign in to leave a comment.