Cracking the Code: PDF Text Extraction

Cracking the Code: PDF Text Extraction

Published January 1, 2023

Extracting text and other objects from PDFs can be a complicated yet necessary function for many types of businesses. Whether you need to extract data from PDF forms, pull out phone numbers or other personal identifiers to add to a database, or consolidate annotations, we’ve got the code to help you automate your PDF text extraction. In this post, we’ll take a look “under the hood” at a few of our text extraction code samples for the Adobe PDF Library.

Extracting Page Text

The ExtractText sample shows how to utilize the WordFinder to pull the text from a page’s content stream. The WordFinder configuration (PDWordFinderConfigRec) allows an application to customize the WordFinder results such as ignoring invisible text. 

The application can then iterate through the Word results, retrieving the position information as well as style information. Here’s how the code looks:
 

PDWordFinderConfigRec wfConfig;

memset(&wfConfig, 0, sizeof(PDWordFinderConfigRec));

wfConfig.recSize = sizeof(PDWordFinderConfigRec);

wfConfig.noTextRenderMode3 = true;

PDWordFinder wordFinder =

       PDDocCreateWordFinderEx(inAPDoc.getPDDoc(), WF_LATEST_VERSION, false, &wfConfig);

PDWord wordList;

ASInt32 numWordsFound;

PDWordFinderAcquireWordList(wordFinder, 0, &wordList, NULL, NULL, &numWordsFound);

Extracting PDF Forms Data

The ExtractAcroformFieldData sample shows how to extract text from the AcroForm fields in a PDF document. This is useful for those who use fillable forms in PDFs and need to extract the text within those forms as a .JSON file to use in a text editor or web browser. Here’s what that portion of the code looks like:

  const char *DEF_INPUT = "../../../../Resources/Sample_Input/ExtractAcroFormFieldData.pdf"; ← Input document (PDF)

const char *DEF_OUTPUT = "ExtractAcroFormFieldData-out.json"; ← Output document (JSON)

   APDFLDoc inAPDoc(DEF_INPUT, true);

        // This array will hold the JSON stream that we will print to the output JSON file.

        json result = json::array();

        // Create the TextExtract object

        TextExtract textExtract(inAPDoc.getPDDoc());

        std::vector<PDAcroFormExtractRec> extractedText = textExtract.GetAcroFormFieldData();

Extracting Text Patterns

ExtractTextByPatternMatch searches for patterns within the text of a document, such as phone numbers, using simple commands and extracts the data into a .TXT file. For example, phone numbers in the U.S. are set up ###-###-####, but that format varies worldwide. This sample makes it easy to extract any phone number by using ‘PHONE_PATTERN’ in the code instead of ((1-)?(\()?\d{3}(\))?(\s)?(-)?\d{3}-\d{4}) Here’s how that looks in the context of the code:

const char *DEF_INPUT = "../../../../Resources/Sample_Input/ExtractTextByPatternMatch.pdf"; ← Input document (PDF)

const char *DEF_OUTPUT = "ExtractTextByPatternMatch-out.txt"; ← Output document (TXT)

// This sample will look for text that matches a phone number pattern

const char *DEF_PATTERN = regexPattern[PHONE_PATTERN];

Annotation Consolidation

PDFs can contain thousands of annotations and the ExtractTextFromAnnotations sample shows how to pull that information out and save it to a separate text file (.JSON). For example, contract negotiations may include comments and questions that have been accepted or rejected, and this function can extract that data. 

const char *DEF_INPUT = "../../../../Resources/Sample_Input/sample_annotations.pdf"; ← Input document (PDF)

const char *DEF_OUTPUT = "ExtractTextFromAnnotations-out.json"; ← Output document (JSON)

json textObject = json::object();

            textObject["annotation-type"] = extractedText[textIndex].type;

            textObject["annotation-text"] = extractedText[textIndex].text;

            result.push_back(textObject);

We invite you to check out the Datalogics GitHub Repository for more information on Adobe PDF Library and samples for the creation, modification and management of PDF documents

Start a free trial and discover how our PDF SDK can help you minimize your development time.