Datalogics New Text Extraction Code Samples
Get Information You Need with Precise & Simple Commands
Extracting text has become an essential part of the PDF workflow for many organizations. The Datalogics team has created new extraction samples for C++, .NET, and .NETCore to help you create precise workflows for your requirements.
Text Extraction Samples & Use Cases
Fillable Forms
The ExtractAcroFormFieldDatasample shows how to extract text from the AcroForm fields in a PDF document. This is useful for those who work with fillable forms in PDFs and need to extract the text within the Acroforms as a .JSON file to use in a text editor or web browser.
Patterns
ExtractTextByPatternMatch searches for patterns within the text of a document, such as phone numbers, using simple overarching commands and extracts the data into a .TXT file. For example, phone numbers in the U.S. are set up ###-###-####, but that format varies worldwide. This sample makes it easy to extract any phone number by simply using ‘PHONE_PATTERN’ in the code instead of ((1-)?(\()?\d{3}(\))?(\s)?(-)?\d{3}-\d{4})
The ExtractCJKTextByPatternMatch sample shows how to search for Unicode characters such as Chinese, Japanese, and Korean (CJK). With more than 1.5 billion people speaking those languages (and growing), organizations must be able to extract millions of different types of characters correctly. The sample on GitHub uses a Korean character in its code.
Read PDF Text Extraction 101 for tips on text extraction.
Regions
ExtractTextByRegion has to do with extracting text from a specific region of a page in a PDF document, which then saves the extracted text to a .TXT file. For example, companies who have thousands of invoices with the same number format that need those numbers extracted from that specific region on the PDF, or when the IRS must pull social security numbers from that section of their 1044s, can use ExtractTextByRegion to accomplish that task.
ExtractTextFromMultiRegions This processes PDF files in a folder and extracts text from multiple specific regions of its pages and saves the text to a .CSV file. For example, this command can create a single file with all the invoice numbers, dates, order numbers, customer IDs, and total from the invoices in the folder, so you have all the data you need in one view.
Consolidating Annotations
PDFs can contain thousands of annotations and the ExtractTextFromAnnotations sample shows how to pull that information out and save it to a separate text file (.JSON). For example, contract negotiations may include comments and questions that have been accepted or rejected, and this function can extract that data.
Style Preservation
ExtractTextPreservingStyleAndPositionInfo This sample extracts all text from the PDF along with information about the text (in a .JSON file) such as its style, color, and font size for style preservation.
Read Cracking the Code: PDF Text Extraction to see the code in action.
Check out the Datalogics GitHub Repository for more information on Adobe PDF Library and samples for the creation, modification and management of PDF documents.