OCR Prompts to extract the best possible text from even (nearly) illegible images.
You can prompt an LLM like ChatGPT with just one prompt to better and more thoroughly extract text from an image.
This approach involves iterative improvements and multiple techniques to extract the best possible text from an image. It's more of a systematic troubleshooting and enhancement process that you prompt for than a Chain of Thought (CoT), a reasoning improvement method we typically use in problem-solving and decision-making processes.
OCR Prompt to better extract text from an image
Can you please carefully analyze the asset and transcribe it: it is very hard to read and you must run multiple OCR carefully to get the perfect result we are looking for.
[#todo: Attach/upload your image with this prompt]
It then performed several optimizations to ensure the text extraction was as accurate as possible by itself.
ChatGPT does not apply OCR image preprocessing techniques itself?
Explicitly force your LLM to apply image preprocessing techniques itself with this OCR Follow-up Prompt to better extract text from your images.
OCR (Optical Character Recognition) is a technology used to convert different types of documents, such as scanned paper documents, PDFs, or images captured by a digital camera, into editable and searchable data. It works by analyzing the shapes of characters in an image and converting them into machine-readable text.
Why LLMs like ChatGPT are well-suited for OCR-related tasks
ChatGPT OCR Process
The process the LLM follows is a systematic approach to text extraction and enhancement.
Here’s a breakdown of what GPT4o did:
- Initial OCR Attempt: Ran a basic OCR on the original image to assess the text extraction quality.
- Image Preprocessing: Applied various image enhancement techniques, including contrast adjustment, sharpening, and converting the image to grayscale and black and white, to improve OCR accuracy.
- Different OCR Configurations: Experimented with custom OCR configurations to see if they would yield better results.
- Segmented OCR: Divided the image into smaller sections and performed OCR on each part individually to handle complex layouts better.
- Manual Transcription: Recognized that automated OCR wasn't sufficient and manually transcribed the text for accuracy.
Perform OCR on the image
import pytesseract
from PIL import Image
# Load the image
image_path = "/mnt/data/stencil.1024.png"
image = Image.open(image_path)
# Perform OCR on the image
text = pytesseract.image_to_string(image)
text
Improving OCR results by applying image preprocessing techniques
Image Enhancement: Improved the image contrast and applied sharpening to make the text clearer.
- Enhance image for better OCR results
- Perform OCR on the processed image
# Improving OCR results by applying image preprocessing techniques
import pytesseract
from PIL import Image, ImageEnhance, ImageFilter
# Load image
image_path = "/mnt/data/stencil.1024.png"
image = Image.open(image_path)
# Enhance image for better OCR results
enhancer = ImageEnhance.Contrast(image)
image_enhanced = enhancer.enhance(2)
image_filtered = image_enhanced.filter(ImageFilter.SHARPEN)
# Perform OCR on the processed image
text = pytesseract.image_to_string(image_filtered)
text
Optimization 2: Convert image to grayscale
Grayscale Conversion: Converted the image to grayscale to simplify the text extraction process.
- Convert image to grayscale
- Perform OCR on the grayscale image
import pytesseract
# Convert image to grayscale
image_gray = image.convert('L')
# Perform OCR on the grayscale image
text = pytesseract.image_to_string(image_gray)
text
Optimization 3: Convert image to black and white
Black and White Conversion: Transformed the image to black and white to enhance text visibility.
- Convert image to black and white for better OCR results
- Perform OCR on the black and white image
# Convert image to black and white for better OCR results
image_bw = image_gray.point(lambda x: 0 if x < 128 else 255, '1')
# Perform OCR on the black and white image
text_bw = pytesseract.image_to_string(image_bw)
text_bw
Custom OCR Configurations:
Try a different OCR configuration to improve results
Applied custom configurations in OCR to improve text recognition.
# Try a different OCR configuration to improve results
custom_config = r'--oem 3 --psm 6'
text_custom = pytesseract.image_to_string(image_bw, config=custom_config)
text_custom
Further enhancement with adaptive thresholding
Adaptive Thresholding: Use adaptive thresholding to handle varying lighting conditions within the image.
- Convert to numpy array for OpenCV processing
- Apply adaptive thresholding
- Convert back to PIL image
- Perform OCR on the thresholded image
# Further enhancement with additional filtering
import numpy as np
import cv2
# Convert to numpy array for OpenCV processing
image_cv = np.array(image_gray)
# Apply adaptive thresholding
image_adaptive_threshold = cv2.adaptiveThreshold(image_cv, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 2)
# Convert back to PIL imageimage_pil = Image.fromarray(image_adaptive_threshold)
# Perform OCR on the thresholded image
text_threshold = pytesseract.image_to_string(image_pil)
text_threshold
Segmentation: Extracting specific sections of the image
Split the image into smaller, more manageable sections and applied OCR individually to each section.
- Perform OCR on the left half
- Perform OCR on the right half
# Extracting specific sections of the image to improve OCR on smaller parts
left_half = image.crop((0, 0, image.width // 2, image.height))
right_half = image.crop((image.width // 2, 0, image.width, image.height))
# Perform OCR on the left half
text_left = pytesseract.image_to_string(left_half, config=custom_config)
# Perform OCR on the right half
text_right = pytesseract.image_to_string(right_half, config=custom_config)
text_combined = text_left + "\\\\n" + text_right
text_combined
ChatGPT does not apply OCR image preprocessing itself?
You could also explicitly task your LLM to do so if it does not apply these image preprocessing techniques itself.
OCR Follow-up Prompt to force ChatGPT to extract text from an image better
- Image Enhancement: Improve the image contrast and apply sharpening to make the text clearer.
- Grayscale Conversion: Convert the image to grayscale to simplify the text extraction process.
- Black and White Conversion: Transform the image to black and white to enhance text visibility.
- Adaptive Thresholding: Use adaptive thresholding to handle varying lighting conditions within the image.
- Segmentation: Split the image into smaller, more manageable sections and apply OCR individually to each section.
- Custom OCR Configurations: Try and apply different custom configurations in OCR to improve text recognition.
To increase accuracy of your results you must try these techniques to improve text recognition one-after-another and step-by-step.
Tasks =
Apply these custom configurations in OCR to improve text recognition:
Image Enhancement: Improve the image contrast and apply sharpening to make the text clearer.
Grayscale Conversion: Convert the image to grayscale to simplify the text extraction process.
Black and White Conversion: Transform the image to black and white to enhance text visibility.
Adaptive Thresholding: Use adaptive thresholding to handle varying lighting conditions within the image.
Segmentation: Split the image into smaller, more manageable sections and apply OCR individually to each section.
Custom OCR Configurations: Try and apply different custom configurations in OCR to improve text recognition.
Despite these optimizations, the OCR results were often not perfect due to the complexity of the layout and text quality of the images I tested. Therefore, always manually check the transcribed text to ensure accuracy.
Why are LLMs like ChatGPT well-suited for OCR-related tasks?
LLMs are not used to perform the actual OCR process; instead, they enhance and process the output from traditional OCR tools.
Let me explain why.
Language Understanding
Deep Contextual Understanding. LLMs have a deep understanding of context, grammar, and semantics, which helps in accurately interpreting and correcting OCR errors.
LLMs understand context, grammar, and semantics. This helps in interpreting ambiguous or partially recognized text from OCR, allowing for more accurate corrections and enhancements.
If OCR misreads "lead" as "lead" (with different meanings based on context), an LLM can use surrounding text to determine whether it refers to the metal or to leading a group.
Post-OCR Processing
After the initial OCR process, LLMs can refine the text by correcting errors, formatting, and enhancing readability.
Error Correction
After OCR, LLMs can refine the text by correcting errors that are common with OCR, such as misrecognized characters or words.
OCR might read "I1" instead of "Il" in a word like "Illinois". An LLM can correct this based on its understanding of common word structures and context.
Formatting Enhancement
LLMs can reapply or improve the formatting of the OCR output, making it more readable and closer to the original layout. For example, restoring bullet points, numbering in lists, or proper indentation that OCR might miss.
Handling Complex Layouts
LLMs can analyze complex document structures and help in reordering text, identifying headings, and maintaining the logical flow of information.
Structure Analysis
LLMs can analyze complex document layouts, such as multi-column formats, tables, and mixed content types. They help reorder text and maintain logical flow.
For example, if a document has side-by-side columns, an LLM can identify and restructure the text so it reads correctly from top to bottom.
Headings and Sections
LLMs can identify headings, subheadings, and sections, ensuring the document's logical and visual structure is preserved. Like recognizing a bold, large font as a heading and appropriately organizing subsequent text as a section.
Adaptability
They can be trained or fine-tuned on specific types of documents or specialized vocabularies, improving accuracy for niche applications.
Specialized Training
LLMs can be fine-tuned on specific types of documents or specialized vocabularies, improving accuracy in niche applications.
Example: Legal documents, medical records, or technical manuals with specific terminologies can be better processed by an LLM trained on those specific vocabularies.
Language Adaptation
LLMs can handle multiple languages and dialects, adapting to the nuances of different linguistic structures. Like processing multilingual documents, ensuring proper context and meaning are preserved in each language.
Overall Workflow Integration
- Initial OCR: A traditional OCR tool scans the document and provides a rough text output.
- LLM Enhancement: The LLM processes this output, correcting errors, enhancing readability, and restructuring text as needed.
- Contextual Refinement: The LLM uses its language understanding to correct ambiguous or context-dependent errors.
- Formatting and Layout: The LLM applies formatting and structure based on the document’s logical flow and visual layout.
- Adaptation for Specific Needs: If necessary, the LLM applies specialized training to better handle specific types of documents or languages.
By combining the strengths of OCR technology with the advanced capabilities of LLMs, the overall quality and usability of digitized text from images or scanned documents are significantly improved.
How GPTs enhance OCR-related tasks
LLMs like ChatGPT significantly enhance and improve the output from traditional OCR tools in several key ways.
Firstly, traditional OCR outputs often contain errors, such as misrecognized characters or words, particularly with poor image quality or unusual fonts. LLMs can correct these errors by leveraging context. For example, they can replace "rn" with "m" in words or correct "I" to "1" in numeric contexts.
Secondly, traditional OCR struggles with understanding context, leading to incorrect word choices or formatting issues. LLMs can make sense of ambiguous or partially recognized text, ensuring the output is meaningful and contextually accurate.
Thirdly, traditional OCR may not preserve the original document's layout, such as headings, bullet points, or tables. LLMs can analyze the content to identify and reapply these structural elements, improving readability and usability. For instance, they can recognize and reformat lists, tables, and sections.
Additionally, traditional OCR produces plain text without understanding the semantic relationships between different parts of the text. LLMs can enrich the text by adding semantic layers, such as identifying key points, summarizing content, or categorizing information.
Traditional OCR also struggles with multi-column layouts, embedded images, or mixed content types. LLMs can reorganize text extracted from complex layouts, ensuring the logical flow and integrity of the information. They can also identify and annotate embedded elements, providing a more comprehensive output.
Furthermore, traditional OCR may have limitations in handling multiple languages or dialects accurately. LLMs can be fine-tuned on specific languages or dialects, improving accuracy and providing better support for multilingual documents.
Finally, traditional OCR provides raw text without additional insights. LLMs can perform various natural language processing (NLP) tasks such as named entity recognition, sentiment analysis, and topic modeling on the OCR output, providing deeper insights and actionable information.
By integrating LLMs with traditional OCR tools, the overall accuracy, readability, and usability of the OCR output are significantly enhanced, making the extracted data more valuable and practical for various applications.