How to Extract the Text from Images using ChatGPT

OCR Prompts to Extract the Best Possible Text Using ChatGPT

OCR Prompts to extract the best possible text from even (nearly) illegible images.

Bastian Moritz

Jul 2024

Update

•

Min

OCR Prompts to extract the best possible text from even (nearly) illegible images.

Jul 2024

How to Extract the Text from Images using ChatGPT

Bastian Moritz

OCR Prompts to extract the best possible text from even (nearly) illegible images.

You can prompt an LLM like ChatGPT with just one prompt to better and more thoroughly extract text from an image.

This approach involves iterative improvements and multiple techniques to extract the best possible text from an image. It's more of a systematic troubleshooting and enhancement process that you prompt for than a Chain of Thought (CoT), a reasoning improvement method we typically use in problem-solving and decision-making processes.

OCR Prompt to better extract text from an image

Can you please carefully analyze the asset and transcribe it: it is very hard to read and you must run multiple OCR carefully to get the perfect result we are looking for. [#todo: Attach/upload your image with this prompt]

It then performed several optimizations to ensure the text extraction was as accurate as possible by itself.

ChatGPT does not apply OCR image preprocessing techniques itself?

Explicitly force your LLM to apply image preprocessing techniques itself with this OCR Follow-up Prompt to better extract text from your images.

OCR (Optical Character Recognition) is a technology used to convert different types of documents, such as scanned paper documents, PDFs, or images captured by a digital camera, into editable and searchable data. It works by analyzing the shapes of characters in an image and converting them into machine-readable text.

Why LLMs like ChatGPT are well-suited for OCR-related tasks

ChatGPT OCR Process

The process the LLM follows is a systematic approach to text extraction and enhancement.

Here’s a breakdown of what GPT4o did:

Initial OCR Attempt: Ran a basic OCR on the original image to assess the text extraction quality.
Image Preprocessing: Applied various image enhancement techniques, including contrast adjustment, sharpening, and converting the image to grayscale and black and white, to improve OCR accuracy.
Different OCR Configurations: Experimented with custom OCR configurations to see if they would yield better results.
Segmented OCR: Divided the image into smaller sections and performed OCR on each part individually to handle complex layouts better.
Manual Transcription: Recognized that automated OCR wasn't sufficient and manually transcribed the text for accuracy.

Perform OCR on the image

‍

import pytesseract
from PIL import Image
# Load the image
image_path = "/mnt/data/stencil.1024.png"
image = Image.open(image_path)
# Perform OCR on the image
text = pytesseract.image_to_string(image)
text

‍

Improving OCR results by applying image preprocessing techniques

Image Enhancement: Improved the image contrast and applied sharpening to make the text clearer.

Enhance image for better OCR results
Perform OCR on the processed image

‍

# Improving OCR results by applying image preprocessing techniques
import pytesseract
from PIL import Image, ImageEnhance, ImageFilter
# Load image
image_path = "/mnt/data/stencil.1024.png"
image = Image.open(image_path)
# Enhance image for better OCR results
enhancer = ImageEnhance.Contrast(image)
image_enhanced = enhancer.enhance(2)
image_filtered = image_enhanced.filter(ImageFilter.SHARPEN)
# Perform OCR on the processed image
text = pytesseract.image_to_string(image_filtered)
text

‍

Optimization 2: Convert image to grayscale

Grayscale Conversion: Converted the image to grayscale to simplify the text extraction process.

Convert image to grayscale
Perform OCR on the grayscale image‍

import pytesseract
# Convert image to grayscale
image_gray = image.convert('L')

# Perform OCR on the grayscale image
text = pytesseract.image_to_string(image_gray)
text

‍

Optimization 3: Convert image to black and white

Black and White Conversion: Transformed the image to black and white to enhance text visibility.

Convert image to black and white for better OCR results
Perform OCR on the black and white image

‍

# Convert image to black and white for better OCR results 
image_bw = image_gray.point(lambda x: 0 if x < 128 else 255, '1')

# Perform OCR on the black and white image
text_bw = pytesseract.image_to_string(image_bw)
text_bw

‍

Custom OCR Configurations:

Try a different OCR configuration to improve results

Applied custom configurations in OCR to improve text recognition.

‍‍

# Try a different OCR configuration to improve results
custom_config = r'--oem 3 --psm 6'
text_custom = pytesseract.image_to_string(image_bw, config=custom_config)
text_custom

Further enhancement with adaptive thresholding

Adaptive Thresholding: Use adaptive thresholding to handle varying lighting conditions within the image.

Convert to numpy array for OpenCV processing
Apply adaptive thresholding
Convert back to PIL image
Perform OCR on the thresholded image

‍

# Further enhancement with additional filtering
import numpy as np
import cv2

# Convert to numpy array for OpenCV processing
image_cv = np.array(image_gray)

# Apply adaptive thresholding
image_adaptive_threshold = cv2.adaptiveThreshold(image_cv, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 11, 2)

# Convert back to PIL imageimage_pil = Image.fromarray(image_adaptive_threshold)

# Perform OCR on the thresholded image
text_threshold = pytesseract.image_to_string(image_pil)

text_threshold

‍

Segmentation: Extracting specific sections of the image

Split the image into smaller, more manageable sections and applied OCR individually to each section.

Perform OCR on the left half
Perform OCR on the right half

‍

# Extracting specific sections of the image to improve OCR on smaller parts
left_half = image.crop((0, 0, image.width // 2, image.height))
right_half = image.crop((image.width // 2, 0, image.width, image.height))

# Perform OCR on the left half
text_left = pytesseract.image_to_string(left_half, config=custom_config)

# Perform OCR on the right half
text_right = pytesseract.image_to_string(right_half, config=custom_config)

text_combined = text_left + "\\\\n" + text_right
text_combined

‍

ChatGPT does not apply OCR image preprocessing itself?

You could also explicitly task your LLM to do so if it does not apply these image preprocessing techniques itself.

OCR Follow-up Prompt to force ChatGPT to extract text from an image better

Image Enhancement: Improve the image contrast and apply sharpening to make the text clearer.
Grayscale Conversion: Convert the image to grayscale to simplify the text extraction process.
Black and White Conversion: Transform the image to black and white to enhance text visibility.
Adaptive Thresholding: Use adaptive thresholding to handle varying lighting conditions within the image.
Segmentation: Split the image into smaller, more manageable sections and apply OCR individually to each section.
Custom OCR Configurations: Try and apply different custom configurations in OCR to improve text recognition.

‍

To increase accuracy of your results you must try these techniques to improve text recognition one-after-another and step-by-step.
Tasks = 
Apply these custom configurations in OCR to improve text recognition:
Image Enhancement: Improve the image contrast and apply sharpening to make the text clearer.
Grayscale Conversion: Convert the image to grayscale to simplify the text extraction process.
Black and White Conversion: Transform the image to black and white to enhance text visibility.
Adaptive Thresholding: Use adaptive thresholding to handle varying lighting conditions within the image.
Segmentation: Split the image into smaller, more manageable sections and apply OCR individually to each section.
Custom OCR Configurations: Try and apply different custom configurations in OCR to improve text recognition.

‍

Despite these optimizations, the OCR results were often not perfect due to the complexity of the layout and text quality of the images I tested. Therefore, always manually check the transcribed text to ensure accuracy.

Why are LLMs like ChatGPT well-suited for OCR-related tasks?

LLMs are not used to perform the actual OCR process; instead, they enhance and process the output from traditional OCR tools.

Let me explain why.

Language Understanding

Deep Contextual Understanding. LLMs have a deep understanding of context, grammar, and semantics, which helps in accurately interpreting and correcting OCR errors.

LLMs understand context, grammar, and semantics. This helps in interpreting ambiguous or partially recognized text from OCR, allowing for more accurate corrections and enhancements.

If OCR misreads "lead" as "lead" (with different meanings based on context), an LLM can use surrounding text to determine whether it refers to the metal or to leading a group.

Post-OCR Processing

After the initial OCR process, LLMs can refine the text by correcting errors, formatting, and enhancing readability.

Error Correction

After OCR, LLMs can refine the text by correcting errors that are common with OCR, such as misrecognized characters or words.

OCR might read "I1" instead of "Il" in a word like "Illinois". An LLM can correct this based on its understanding of common word structures and context.

Formatting Enhancement

LLMs can reapply or improve the formatting of the OCR output, making it more readable and closer to the original layout. For example, restoring bullet points, numbering in lists, or proper indentation that OCR might miss.

Handling Complex Layouts

LLMs can analyze complex document structures and help in reordering text, identifying headings, and maintaining the logical flow of information.

Structure Analysis

LLMs can analyze complex document layouts, such as multi-column formats, tables, and mixed content types. They help reorder text and maintain logical flow.

For example, if a document has side-by-side columns, an LLM can identify and restructure the text so it reads correctly from top to bottom.

Headings and Sections

LLMs can identify headings, subheadings, and sections, ensuring the document's logical and visual structure is preserved. Like recognizing a bold, large font as a heading and appropriately organizing subsequent text as a section.

Adaptability

They can be trained or fine-tuned on specific types of documents or specialized vocabularies, improving accuracy for niche applications.

Specialized Training

LLMs can be fine-tuned on specific types of documents or specialized vocabularies, improving accuracy in niche applications.

Example: Legal documents, medical records, or technical manuals with specific terminologies can be better processed by an LLM trained on those specific vocabularies.

Language Adaptation

LLMs can handle multiple languages and dialects, adapting to the nuances of different linguistic structures. Like processing multilingual documents, ensuring proper context and meaning are preserved in each language.

Overall Workflow Integration

Initial OCR: A traditional OCR tool scans the document and provides a rough text output.
LLM Enhancement: The LLM processes this output, correcting errors, enhancing readability, and restructuring text as needed.
Contextual Refinement: The LLM uses its language understanding to correct ambiguous or context-dependent errors.
Formatting and Layout: The LLM applies formatting and structure based on the document’s logical flow and visual layout.
Adaptation for Specific Needs: If necessary, the LLM applies specialized training to better handle specific types of documents or languages.

By combining the strengths of OCR technology with the advanced capabilities of LLMs, the overall quality and usability of digitized text from images or scanned documents are significantly improved.

How GPTs enhance OCR-related tasks

LLMs like ChatGPT significantly enhance and improve the output from traditional OCR tools in several key ways.

Firstly, traditional OCR outputs often contain errors, such as misrecognized characters or words, particularly with poor image quality or unusual fonts. LLMs can correct these errors by leveraging context. For example, they can replace "rn" with "m" in words or correct "I" to "1" in numeric contexts.

Secondly, traditional OCR struggles with understanding context, leading to incorrect word choices or formatting issues. LLMs can make sense of ambiguous or partially recognized text, ensuring the output is meaningful and contextually accurate.

Thirdly, traditional OCR may not preserve the original document's layout, such as headings, bullet points, or tables. LLMs can analyze the content to identify and reapply these structural elements, improving readability and usability. For instance, they can recognize and reformat lists, tables, and sections.

Additionally, traditional OCR produces plain text without understanding the semantic relationships between different parts of the text. LLMs can enrich the text by adding semantic layers, such as identifying key points, summarizing content, or categorizing information.

Traditional OCR also struggles with multi-column layouts, embedded images, or mixed content types. LLMs can reorganize text extracted from complex layouts, ensuring the logical flow and integrity of the information. They can also identify and annotate embedded elements, providing a more comprehensive output.

Furthermore, traditional OCR may have limitations in handling multiple languages or dialects accurately. LLMs can be fine-tuned on specific languages or dialects, improving accuracy and providing better support for multilingual documents.

Finally, traditional OCR provides raw text without additional insights. LLMs can perform various natural language processing (NLP) tasks such as named entity recognition, sentiment analysis, and topic modeling on the OCR output, providing deeper insights and actionable information.

By integrating LLMs with traditional OCR tools, the overall accuracy, readability, and usability of the OCR output are significantly enhanced, making the extracted data more valuable and practical for various applications.

Content

Switch Language

Mentioned

Contributors

Bastian Moritz

Growth Advisor

Subscribe to Newsletter

Deine Anmeldung konnte nicht gespeichert werden. Bitte versuche es erneut.

Deine Anmeldung war erfolgreich.

Published

Jul 2024

Latest Update

2024-07-04

Newsletter

Stay in the Know: Get our new articles, videos and event info.

Join all the fine folks interested in how to achieve solid sustainable growth with customer-centric strategies, methods, frameworks...

Stay as long as you'd like. Unsubscribe anytime.

Subscribe to Newsletter

Artificial Intelligence

The Customer Centroid—A Blog About Customer-centric & Organic Growth

Ready? Set. Growth!
Learn about growing your organization and the impact of its mission and other insights & stories about Customer-centricity and Organic Growth:

How to Extract the Text from Images using ChatGPT

OCR Prompts to Extract the Best Possible Text Using ChatGPT

How to Extract the Text from Images using ChatGPT

OCR Prompt to better extract text from an image

ChatGPT OCR Process

Perform OCR on the image

Improving OCR results by applying image preprocessing techniques

Optimization 2: Convert image to grayscale

Optimization 3: Convert image to black and white

Custom OCR Configurations:

Try a different OCR configuration to improve results

Further enhancement with adaptive thresholding

Segmentation: Extracting specific sections of the image

ChatGPT does not apply OCR image preprocessing itself?

OCR Follow-up Prompt to force ChatGPT to extract text from an image better

Why are LLMs like ChatGPT well-suited for OCR-related tasks?

Language Understanding

Post-OCR Processing

Error Correction

Formatting Enhancement

Handling Complex Layouts

Structure Analysis

Headings and Sections

Adaptability

Specialized Training

Language Adaptation

Overall Workflow Integration

How GPTs enhance OCR-related tasks

Stay in the Know: Get our new articles, videos and event info.

Artificial Intelligence

The Customer Centroid—A Blog About Customer-centric & Organic Growth

Advanced Translation Prompting: Optional Steps of Professional Translators

Language Translator Mega Prompt

How Seth Godin Leverages Artificial Intelligence for Honest Reflection, Insight, and Innovation

David Deutsch Using Artificial Intelligence

System Role Prompt Engineering Using Job Descriptions for AI Agents

How-to Make ChatGPT Access an URL

ChatGPT Custom Instructions for a more personal AI