PDF recognition, extraction of text and individual images Answered


I am evaluating the ABBYY Finereader Engine 12 for Linux, and I would like to know how to extract the text and the individual image areas from a given PDF document, and then how to export the text into an HTML file and every image area to a JPG or PNG image file. This operation is available inside the desktop application for MacOS, is that possible with this SDK?

Moreover is it possible to find some source code?

Many thanks for an answer.

Kind regards




  • Avatar
    Nadezhda A. Solovyeva

    Hi Oliver,

    In FineReader Engine, you use the following source code (based on the "Hello" sample)

    void processImage()
    // Create document from image file
    displayMessage( L"Loading image..." );
    CBstr imagePath = Concatenate( GetSamplesFolder(), L"/SampleImages/Demo.tif" );
    CSafePtr<IFRDocument> frDocument = 0;
    CheckResult( FREngine->CreateFRDocumentFromImage( imagePath, 0, frDocument.GetBuffer() ) );

    //Recognize document
    displayMessage( L"Recognizing..." );
    CheckResult( frDocument->Process() );

    // Save results
    displayMessage( L"Saving results..." );
    CBstr exportPath = Concatenate( GetSamplesFolder(), L"/SampleImages/Demo.html" );
    CheckResult( frDocument->Export(  exportPath, FEF_HTMLUnicodeDefaults, 0  ) );



    Comment actions Permalink
  • Avatar
    Olivier von Dach

    Hi Nadezhda,

    Thanks for your answer.

    I suppose I should continue my investigation using your sample source code.

    I am still wondering if the image areas detected during the recognition process are also exported to individual files and saved into the sample folder, beside the Demo.html file, and then Demo.html should reference these individual image files. I cannot read any specific instruction for that, nor image file specification.

    Kind regards.


    Comment actions Permalink
  • Avatar
    Nadezhda A. Solovyeva

    Hi Olivier,

    Please use HTMLExportParams.PictureExportParams object for adjusting export picture formats. The HTMLExportParams object is a 3rd parameter of  frDocument.Export method.

    Comment actions Permalink

Please sign in to leave a comment.