Community

check for duplicate PDF files in a collection of files

Is it possible to have an ABBYY product scan my collection of PDF files automatically and report duplicate files?

Background:

I have a lot of PDF files and some of these files in the collection are (more or less) the same. If you look at the contents (OCR). I would to scan all the PDF files automatically and get a report of files which are (for example min. 90% the same, based on contents after OCR scan).

Basically something like Anti-Twin does for .jpg files, but now for PDF files.

Thanks in advance.

Was this article helpful?

0 out of 0 found this helpful

Comments

5 comments

  • Avatar
    Scott Chau

    Our engine product doesn't have that option out of the box.  You would have to programmatically check the file hash.

    Otherwise, look at our Fine Reader Server product.  It has a built in functionality to check duplicate file via hash.

    Getting a report on duplicate files with FineReader Server 14 Audit Workflow – Help Center (abbyy.com)

    0
  • Avatar
    Steffen Salomo

    Hey Mathijs Groen is it still relevant for you?

    0
  • Avatar
    Mathijs Groen

    Steffen Salomo yes it is

    0
  • Avatar
    Steffen Salomo

    I have done the following: I read all PDF pages with FlexiLayout, and wrote them into a text field. Then I went over the documents in a loop and compared the content of each page with the content of the other page. Keywords for this are Levensthein or Cosine Similarity. Then you can say in the code what percentage may be used as a threshold, for example. If this threshold is exceeded, you have found duplicates. If you would like more information on this, I would be happy to send you a snippet if you have an e-mail address for me

    0
  • Avatar
    Mathijs Groen

    Steffen Salomo I do not have any knowledge of FlexiLayout. Do not use any scripting or whatsoever. I am just an end user of ABBYY Finereader

    0

Please sign in to leave a comment.