Cracking the Code: Adding OCR to a PDF
Optical Character Recognition, or OCR, is the process that converts an image of text into a machine-readable text format. For example, if you scan a form or a receipt, your computer saves the scan as an image file, meaning you can’t use a text editor to edit, search, or count the words in the image. OCR converts the image into a text document with its contents stored as text data, therefore it can be edited and searched.
One of the most common use cases for OCR is in preparing documents for searching or extracting the data into another process. By using OCR PDF APIs, the text data within these images is accessible without modifying the look of the input document. Let’s walk through some of the key components of our OCR API in the Adobe PDF Library using .NET.
OCRParams ocrParams = new OCRParams();
ocrParams.PageSegmentationMode = PageSegmentationMode.Automatic;
ocrParams.Performance = Performance.BestAccuracy;
OCREngine ocrEngine = new OCREngine(ocrParams)
Setting the PageSegmentationMode to Automatic lets the OCR engine choose how to segment the page for text detection. The Performance parameter allows for multiple levels of granularity when choosing speed vs performance. In this case, we are selecting the mode that will output the best accuracy. This is a common setting when you are unsure of the quality of your input document. The OCRParams will default to English; you’ll need to use the Languages parameter to select other languages. Multiple languages can be selected at the same time.
Once the OCREngine is configured, we can loop through the content of the document, identify the images, and apply the OCR processing:
Element e = content.GetElement(index);
if (e is Datalogics.PDFL.Image) {
Form form = engine.PlaceTextUnder((Image)e, doc);
content.RemoveElement(index);
content.AddElement(form, index -1);
}
The image object is replaced by a form, which contains the original image and the identified text laid out behind it. Once this step is complete, the resulting document can be saved and it will contain the original content and the identified text.
As an added benefit, the .NET and Java interfaces support Dutch, English, French, German, Italian, Portuguese and Spanish languages, and with additional Chinese, Japanese and Korean languages to be added shortly. Try it out yourself by requesting a free trial, and feel free to take a look at our full sample code for Java and .NET (which includes how to start this process from an image rather than a PDF) under the OpticalCharacterRecognition section inside Sample_Source.
Streamline your development workflow
Start a free trial and discover how our PDF SDK can minimize your development time.