Is it possible to have an ABBYY product scan my collection of PDF files automatically and report duplicate files?
Background:
I have a lot of PDF files and some of these files in the collection are (more or less) the same. If you look at the contents (OCR). I would to scan all the PDF files automatically and get a report of files which are (for example min. 90% the same, based on contents after OCR scan).
Basically something like Anti-Twin does for .jpg files, but now for PDF files.
Thanks in advance.
Comments
5 comments
Our engine product doesn't have that option out of the box. You would have to programmatically check the file hash.
Otherwise, look at our Fine Reader Server product. It has a built in functionality to check duplicate file via hash.
Getting a report on duplicate files with FineReader Server 14 Audit Workflow – Help Center (abbyy.com)
Hey Mathijs Groen is it still relevant for you?
Steffen Salomo yes it is
I have done the following: I read all PDF pages with FlexiLayout, and wrote them into a text field. Then I went over the documents in a loop and compared the content of each page with the content of the other page. Keywords for this are Levensthein or Cosine Similarity. Then you can say in the code what percentage may be used as a threshold, for example. If this threshold is exceeded, you have found duplicates. If you would like more information on this, I would be happy to send you a snippet if you have an e-mail address for me
Steffen Salomo I do not have any knowledge of FlexiLayout. Do not use any scripting or whatsoever. I am just an end user of ABBYY Finereader
Please sign in to leave a comment.