Community

check for duplicate PDF files in a collection of files

February 25, 2021 11:33
5

Is it possible to have an ABBYY product scan my collection of PDF files automatically and report duplicate files?

Background:

I have a lot of PDF files and some of these files in the collection are (more or less) the same. If you look at the contents (OCR). I would to scan all the PDF files automatically and get a report of files which are (for example min. 90% the same, based on contents after OCR scan).

Basically something like Anti-Twin does for .jpg files, but now for PDF files.

Thanks in advance.

Was this article helpful?

0 out of 0 found this helpful

Comments

5 comments

Scott Chau

March 10, 2021 22:37
Our engine product doesn't have that option out of the box. You would have to programmatically check the file hash.

Otherwise, look at our Fine Reader Server product. It has a built in functionality to check duplicate file via hash.

Getting a report on duplicate files with FineReader Server 14 Audit Workflow – Help Center (abbyy.com)

0
Steffen Salomo

August 09, 2024 12:24
Hey Mathijs Groen is it still relevant for you?

0
Mathijs Groen

August 09, 2024 13:03
Steffen Salomo yes it is

0
Steffen Salomo

August 09, 2024 14:09
I have done the following: I read all PDF pages with FlexiLayout, and wrote them into a text field. Then I went over the documents in a loop and compared the content of each page with the content of the other page. Keywords for this are Levensthein or Cosine Similarity. Then you can say in the code what percentage may be used as a threshold, for example. If this threshold is exceeded, you have found duplicates. If you would like more information on this, I would be happy to send you a snippet if you have an e-mail address for me

0
Mathijs Groen

August 09, 2024 15:30
Steffen Salomo I do not have any knowledge of FlexiLayout. Do not use any scripting or whatsoever. I am just an end user of ABBYY Finereader

0

Please sign in to leave a comment.

Community

check for duplicate PDF files in a collection of files

Was this article helpful?

Comments

Didn't find what you were looking for?