OCR stands for Optical Character Recognition, which is a technology used to convert scanned images, PDFs, and other documents into editable and searchable text. To achieve the desired results, an OCR system has to perform a few steps:
- Pre-processing: The first step in OCR is to prepare the image or document for analysis. This may include cropping the image to remove any unnecessary background, adjusting the brightness and contrast to make the text more legible, and rotating the image to the correct orientation.
- Segmentation: The next step is to divide the image into small segments, usually called “blobs” or “regions,” that contain individual characters or words. This is done by analyzing the image and identifying areas that are likely to contain text based on factors such as color, texture, and size.
- Feature extraction: Once the image has been segmented, the next step is to extract features from each segment. These features are text characteristics that can be used to identify the characters or words. Standard features include the shape of the text, the spacing between characters, and the relative position of the text within the segment.
- Recognition: This step is where the OCR software compares the features of the segmented text to a database of known characters or words. The software assigns a probability to each character or word that it recognizes and uses this information to determine the most likely match.
- Post-processing: After the text has been recognized, the final step is to clean up the output and correct any errors. This may include fixing any spelling mistakes, removing any unwanted characters, and formatting the text to make it more readable.
- Output: The OCR software outputs the recognized text as an editable document, which can be saved in various formats such as txt, doc, pdf. The recognized text can be used in various applications such as search engines, machine learning, and data analytics.
These above steps might sound pretty simple but there are lots of ways of doing the same thing with varying performance and results. There are a lot of Open Source and enterprise solutions present out there. The most popular Open Source OCR project is tesseract. But PaddleOCR is gaining popularity too and is better than tesseract in some aspects like for reading texts that are not in the correct orientation or extracting a table from an image.
In the coming weeks, I’ll try to write more about the individual steps in more detail.
Thanks for reading!