PHP scan requests failing to return text

We are scanning phonebooks in Mexico. We are scanning each column separately at 600dpi. We started with the PHP sample code. The processImage call specifies Spanish and txt format. But about 50% of the time the curl_exec call returns and empty string. Here is a sample scan: http://digitalfire.com/culiacan/pictures/168.jpg

Also, we are interested in any recommendations specific to scanning phonebooks.

Thanks.

UPDATE

I have emailed the code and have not heard anything so I will try to paste it code here in sections:

function OCR($fileName) {

  $applicationId = 'edited out';
  $password = 'edited out';
  $url = 'http://cloud.ocrsdk.com/processImage?language=spanish&exportFormat=txt';

  $ch = curl_init();
  curl_setopt($ch, CURLOPT_URL, $url);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
  curl_setopt($ch, CURLOPT_USERPWD, "$applicationId:$password");
  curl_setopt($ch, CURLOPT_POST, 1);
  $post_array = array("my_file"=>"@".$fileName,);
  curl_setopt($ch, CURLOPT_POSTFIELDS, $post_array);

  $response = curl_exec($ch); curl_close($ch);

  $xml = simplexml_load_string($response); 
  $arr = $xml->task[0]->attributes();
  $taskid = $arr["id"];
  $url = 'http://cloud.ocrsdk.com/getTaskStatus'; 
  $qry_str = "?taskid=$taskid";

  do { 
     sleep(5); $ch = curl_init(); 
     curl_setopt($ch, CURLOPT_URL, $url.$qry_str);
     curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
     curl_setopt($ch, CURLOPT_USERPWD, 
     "$applicationId:$password"); 
     $response = curl_exec($ch); curl_close($ch);
     $xml = simplexml_load_string($response);
     $arr = $xml->task[0]->attributes();
  } while($arr["status"] != "Completed");

  $url = $arr["resultUrl"];
  $ch = curl_init(); 
  curl_setopt($ch, CURLOPT_URL, $url); 
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
  $response = curl_exec($ch); curl_close($ch);
  return $response; 
}

Was this article helpful?

0 out of 0 found this helpful

Comments

16 comments

  • Avatar
    SDK Support Team

    Let me clarify your question.

    You submit the image, wait until it is processed, and when you are trying to get the results using url like: https://ocrsdk.blob.core.windows.net/files/some_guid.result you get an empty string. Is this correct?

    0
  • Avatar
    SDK Support Team

    The settings that influence recognition quality most of all are recognition language and dpi.

    You already specify Spanish as the recognition language.

    But resolution information is absent in your file, so recognition engine tries to guess it itself. You'd better set dpi to the file explicitly.

    The phonebook text is very small. It is about 4pt on your image. Recognizer works optimally when text size is within 8-12 points scanned at 300 dpi, so in your case it makes sense to set dpi to ~300 dpi for this image to get better results.

    0
  • Avatar
    SDK Support Team

    Anyway you need to check the error message returned by curl_exec. As stated in documentation (http://www.php.net/manual/en/function.curl-exec.php) it returns FALSE on failure. This is probably your case.

    If it returns FALSE, you need to check curl_error function (http://www.php.net/manual/en/function.curl-error.php) as given in example to this article.

    0
  • Avatar
    Nikolay_Kh

    Did you modify the PHP example (beside login and password credentials)? If you did, try running unmodified version.

    One more thing you could use is a browser developer tool in Chrome or a FireBug plugin for FireFox that allows you to see the exact responce. If you are trying to echo response and don't see anything in a browser window or a html source does not necessary mean an emtpy string. The response is an xml format and some browsers (Fifrefox/Webkit) just don't treat xml as a text. Try using IE, you may see actual xml response in it.

    Please let us know if you work it out!

    0
  • Avatar
    thansen

    We used the sample code as a starter to make a function that returns the text. I will paste it into an answer. Am not checking the error message from curl, will add that. I assumed because only specific images always fail to scan the problem with with the OCR engine, not a transmission issue.

    0
  • Avatar
    thansen

    I do not understand how they can be 4pt. The image is more than 1100s pixel wide with only about 75 characters per line. The characters are 30 pixels high, they are huge. I am saving them by exporting a jpg to the web. 8-12 points at 300 dpi is how many pixels high for the letters? Thanks.

    0
  • Avatar
    thansen

    The PHP function we made is based on the sample code and it is working (for about half, or maybe 75% of the images, we are still testing). The response is being saving to a database and is put into a textbox where people do the checking.

    0
  • Avatar
    SDK Support Team

    The character size is about size on paper. On your phonebook sample they are small. So the idea is to virtually decrease dpi and make OCR engine think that they are twice big as they are actually.

    All you need is to modify your jpg files and write 300 dpi to its metadata.

    0
  • Avatar
    Nikolay_Kh

    Please clarify, what you do mean by "function is working 50%-75% of time"? cURL request for processImage method can either return xml (even in case of error, we've checked server logs, there were no errors for your tasks) or a boolean false (which would mean some problems in your code) or a HTTP 401/Unauthorised header in case of incorrect login credentials (which is also not your case as you're receiving some results).

    I assume that there's something wrong with your code. Would you please send it to cloudocrsdkbeta@abbyy.com so we could analyze it. Please specify which cURL request is not working for you in comments (there're 3 of them in the sample). Also, please clear out any login credentials or other/private information from your source code before sending.

    Thanks!

    0
  • Avatar
    thansen

    Thanks. In Photoshop I am changing the image size to 300 dpi (from 600 dpi) without resampling. Thus the image still has the same number of dots it had before. This seems counterintuitive, OCR software must concern itself with pixels, not measurements, is there a page that explains this to help us do the scan the best way. Anyway, these appear to be scanning with fewer errors, but I am not sure yet.

    0
  • Avatar
    thansen

    I have sent it by separate email. It will be great to get the error checking code worked out, I will be happy to share my function when it is working well.

    0
  • Avatar
    Nikolay_Kh

    I've updated your question with the comments you added below. Please avoid using login credentials in your posts and comments. Thanks!

    0
  • Avatar
    Nikolay_Kh

    I've checked the email conversation, your last letter was answered yesterday. I'll duplicate the answer here:

    Your code seems to be right, I’ve tried running with a echo OCR(‘filename.php’); statement about 20-30 times, it worked 100% successfully. There’re a number of suggestions that may help you locating the problem:

    1. Look for apache settings, you may have a low timeout interval or a filesize restrictions;
    2. Try placing echo statements between cURL requests (for xml responces, taskID, task statuses etc.), there are 3 cURL calls, which one doesn’t work ot where exactly the error occurs?
    3. Look at the apache logs: write something like php_value error_log log.txt in .htaccess file, that would dump a server error in log.txt
    4. Turn on PHP error reporting with ini_set('display_errors',1); error_reporting(E_ALL); those might help to locate the error spot as well.

    Please let us know if you could work it out. Thanks!

    1
  • Avatar
    thansen

    I have 180 seconds allocated to the page, it takes less than 60 normally. I added some echo statements and it is failing to retrieve the text from the URL in the last block of curl calls. If I visit the value being echoed for $arr['ResultURL'] in my browser, the text is there. So I replaced the last 5 lines with 'return file_get_contents($url);'. That works also, but the file_get_contents call also still fails to return the text sometimes even though it is there at the URL when I check. Is it possible it is not there in time for the call to get it, do I need to delay before trying to get it?

    0
  • Avatar
    Nikolay_Kh

    You don't need to wait. As soon as the 'status' attribute in getTaskStatus call has the value 'completed' you can download it.

    file_get_contents returns a sting. If you're exporting in a complicated format like pdfSearchable, that file would produce a string of a huge length that may overlap your PHP.ini memory_limit settings.

    Our service cannot provide an empty responce, it would either provide results or send back the error code (which is not your case, all your tasks were recognized correctly). I suggest you dig in your server settings or try testing with more flexible environment/

    0
  • Avatar
    Felix

    Sometimes Curl does not open https URLs, but gives no error.

    $url = $arr["resultUrl"]; // This is a https-URL (Line 68 PHP exampel)

    Write in front of the last curl_exec($ch), which gets the result:

    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
    curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
    
    0

Please sign in to leave a comment.