General
PDF as the Portable Document Format is a file format that originates in printing and Adobe Postscript. Postscript is a programming language that can be used to describe the elements on a document to be able to print it in high quality ('elements' means here text and graphics).
The following illustration shows that a PDF file is also a “computer program” that describes how the content/elements of a page should look like. To display the content that is embedded in the PDF, the “code” has to be rendered, so that the text/images can be displayed on a screen or printed on paper. In this process the PDF code is rasterized to a pixel-based representation.
Structure of a Simple PDF
In the following simple example, you can see that a simple text file was converted to a PDF using a printer driver. To see the “programming code” of the PDF the extension “.pdf” was changed to “.txt” and then the file was opened with a text editor.
Technical Note: For complex PDFs, this approach is not working and not recommended, but for this case it is enough to illustrate how a PDF is built up internally.
The PDF code that describes the page of a document can include a lot of different elements, for example:
-
Metadata
-
Text (in various encoding)
-
Vectors that look like text
-
Vector-based drawings and
-
Images
-
Programming code
-
Movies
-
3D / CAD data
-
etc…
This enumeration already gives a rough idea how complicated the internal structure of a PDF can be. Technically it is also possible to put only a scanned document image into a PDF (= Image-only PDF). In this very simple case, no additional textual information is included.
Finally if any PDF Viewer displays a PDF on screen, it interprets the internal program code and composes a visual representation on screen, technically this process is called “PDF rendering”.
PDF Rendering/Opening and OCR
-
Optical Character Recognition (OCR) was developed to work on scanned document images. So “image-only PDFs” are very close to the original intention because then only the PDF envelope has to be removed and then the image can be processed as it was intended. This simple processing scenario can be applied when documents are scanned and exported as Raster-PDFs.
-
Often real life PDFs are generated not via a scanner, but via a printer driver or direct export out of PDF enabled applications. So this kind of PDF documents have to rendered, to get the final visual representation. It does not matter if this is because a person wants to “see” the page or if this is an OCR Engine.
-
ABBYY SDKs use the Adobe® PDF Library™ to full fill this task.
-
Additionally to the rendered visual representation the PDF can still contain text, bookmarks, metadata or other elements like JavaScript code.
-
The required time for opening and preparing of PDF file for OCR processing can take a significant time and disk space.
- ABBYY's OCR SDKs offers different ways how PDFs processed and converted:
- just use the image and apply OCR
- compare the existing text layer with the rendered text
- if it fits - excellent
- If not - apply OCR
- trust the text layer (no action)
Comments
0 comments
Please sign in to leave a comment.