What is PDF OCR?
Turning Scanned Documents & Images into Documents with Searchable Text
Optical Character Recognition, or OCR, is the process that converts an image of text into a machine-readable text format. For example, if you scan a form or a receipt, your computer saves the scan as an image file, meaning you can’t use a text editor to edit, search, or count the words in the image. OCR converts the image into a text document with its contents stored as text data, therefore it can be edited and searched.
Let’s put it another, more interesting way.
Imagine teaching your computer to read pictures – that's OCR magic! OCR is like giving your machine superhero vision, enabling it to transform scanned documents into editable text. It's the language bridge between pixels and letters, unlocking a world where your computer can decipher the written word from images, opening up a realm of possibilities for digital exploration and accessibility.
History of OCR
OCR was developed in 1974 by Ray Kurzweil, who started Kurzweil Computer Products, Inc. This innovative technology could recognize text that was printed in just about any font - like magic! Kurzweil realized that the best use for his technology would be a machine learning device for those who are blind. He made a reading machine that was able to read text out loud and translate text into a text-to-speech format. He sold his company to Xerox in 1980, because Xerox was interested in continuing to commercialize paper-to-computer text transformation.
OCR technology was not popularized until the early 1990s, when it was being used to digitize historical newspapers. OCR has seen several developments since this time - and today, OCR is able to give users nearly perfect accurate conversions. Document processing workflows have never been the same, as they can be automated through advanced methods of OCR. In the Dark Ages, before this technology was available, documents had to be retyped manually — a painfully time-consuming process, not to mention a clunky one with much higher chances of errors in the content. OCR for PDFs is widely accessible today and continues to increase efficiency for both personal and professional purposes. (source material: Adobe)
Why is PDF OCR important?
Most people live their lives in the digital realm - online, on social media, on their phones, etc. Everything these days is digital. Many businesses, however, are still using print media, including documents like contracts, statements, invoices, tax forms, and scanned legal files. Scanning documents into images can be ridiculously time-consuming, not to mention boring as hell and a total waste of resources. Paper files also take up a lot of physical space and can be difficult to sort through and organize. OCR can save you from all that manual labor by streamlining and automating operations, conducting analytics, and improving productivity overall. TL;DR, it saves your business time and money so you can focus on what’s really important - sharing cat videos with your coworkers.
PDF OCR for Accessibility
In addition to the convenience of being able to scan and search text, OCR provides better access for users who are blind and visually impaired. The OCR recognition process accounts for language and structure and corrects words that it sees as being spelled incorrectly. Its spell-checking technology allows for the most accurate information to be conveyed to users. OCR contains a synthesizer within its system that will speak the recognized text. The content can be accessed by someone who is blind or visually impaired through scanned text using adaptive technology devices that will magnify the computer screen or provide the user with speech to listen to or Braille to read. Through the software, text from scanned documents can be read aloud according to each individual’s specifications.
Types of OCR
There are several different types of OCR software depending on their application and use. Here are some examples:
Simple optical character recognition software uses different text and font image patterns as templates. This type of software uses pattern dash-matching algorithms to find the differences between text images and then analyzes the data by character in its internal database.
Optical word recognition is when the system replicates the text word by word. It is not possible for every font and handwriting style to be captured since there are unlimited amounts of both, so this solution has limits to it.
Intelligent character recognition (ICR) software reads text the same way humans read it because when using machine learning, the machines can be trained to act like humans (scary, huh?). A machine learning system called a neural network studies text and processes images over and over, searching for image aspects such as lines, curves, loops, and intersections. It then combines together the outcome of the different levels of data to get a final conclusion.
Intelligent word recognition technologies work on the same rules as ICR, but they study whole word images instead of pre-modifying the images into characters.
- Optical mark recognition locates watermarks, logos, and other text signs in a document.
How to Add OCR to a Scanned PDF
One of the most common use cases for OCR is in preparing scanned documents for searching or extracting the data within those documents to use in other applications. By using PDF OCR APIs, the text data within these images is accessible without modifying the look of the input document.
Here’s a look at how the Adobe PDF Library SDK handles OCR, via a code sample for .NET:
OCRParams ocrParams = new OCRParams();
ocrParams.PageSegmentationMode = PageSegmentationMode.Automatic;
ocrParams.Performance = Performance.BestAccuracy;
OCREngine ocrEngine = new OCREngine(ocrParams)
Setting the PageSegmentationMode to Automatic lets the OCR engine choose how to segment the page for text detection. The Performance parameter allows for multiple levels of granularity when choosing speed vs performance. In this case, we are selecting the mode that will output the best accuracy. This is a common setting when you are unsure of the quality of your input document. The OCRParams will default to English; you’ll need to use the Languages parameter to select other languages. Multiple languages can be selected at the same time.
Once the OCREngine is configured, we can loop through the content of the document, identify the images, and apply the OCR processing:
Element e = content.GetElement(index);
if (e is Datalogics.PDFL.Image) {
Form form = engine.PlaceTextUnder((Image)e, doc);
content.RemoveElement(index);
content.AddElement(form, index -1);
}
The image object is replaced by a form containing the original image and the identified text behind it. Once this step is complete, the resulting document can be saved and it will contain the original content and the identified text.
Voila! Behold the magic of PDF OCR.
Test it out for yourself with a free trial of Adobe PDF Library SDK and start on your proof of concept today!