본문 바로가기

Develop/Python

multi table ocr

https://pyimagesearch.com/2022/02/28/multi-column-table-ocr/

 

Multi-Column Table OCR

by Adrian Rosebrock on February 28, 2022 Click here to download the source code to this post

 
 
In this tutorial, you will:
  1. Discover a technique for associating rows and columns together
  2. Learn how to detect tables of text/data in an image
  3. Extract the detected table from an image
  4. OCR the text in the table
  5. Apply hierarchical agglomerative clustering (HAC) to associate rows and columns
  6. Build a Pandas 
    DataFrame
     from the OCR’d data

This tutorial is the first in a 4-part series on OCR with Python:

  1. Multi-Column Table OCR (this tutorial)
  2. OpenCV Fast Fourier Transform (FFT) for Blur Detection in Images and Video Streams
  3. OCR’ing Video Streams
  4. Improving Text Detection Speed with OpenCV and GPUs

To learn how to OCR multi-column tables, just keep reading.

Looking for the source code to this post?

JUMP RIGHT TO THE DOWNLOADS SECTION 

Multi-Column Table OCR

Perhaps one of the more challenging applications of optical character recognition (OCR) is how to successfully OCR multi-column data (e.g., spreadsheets, tables, etc.).

On the surface, OCR’ing tables seems like it should be an easier problem, right? However, given that the document has a guaranteed structure, shouldn’t we leverage this a priori knowledge and then OCR each column in the table?

In most cases, yes, that would be the case. But unfortunately, we have a few problems to address:

  1. Tesseract isn’t very good at multi-column OCR, especially if your image is noisy.
  2. You may need first to detect the table in the image before you can OCR it.
  3. Your OCR engine (Tesseract, cloud-based, etc.) may correctly OCR the text but be unable to associate the text into columns/rows.

So, while OCR’ing multi-column data may appear to be an easy task, it’s far harder since we may need to be responsible for associating text into columns and rows — and as you’ll see, that is indeed the most complex part of our implementation.

The good news is that while OCR’ing multi-column data is certainly more demanding than other OCR tasks, it’s not a “hard problem,” provided you bring the right algorithms and techniques to the project.

In this tutorial, you’ll learn some tips and tricks to OCR multi-column data, and most importantly, associate rows/columns of text together.

OCR’ing Multi-Column Data

In the first part of this tutorial, we’ll discuss our multi-column OCR algorithm’s basic process. This is the exact algorithm to OCR multi-column data. It serves as a great starting point, and we recommend using it whenever you need to OCR a table.

From there, we’ll review the directory structure for our project. We’ll also install any additional required Python packages for the tutorial.

With our development environment fully configured, we can move on to our implementation. We’ll spend most of the lesson here, covering our multi-column OCR algorithm’s details and inner workings.

We’ll wrap up the lesson by applying our Python implementation to:

  1. Detect a table of text in an image
  2. Extract the table
  3. OCR the table
  4. Build a Pandas 
    DataFrame
     from the table to process it, query it, etc.

Our Multi-Column OCR Algorithm

Our multi-column OCR algorithm is a multi-step process. To start, we need to accept an input image containing a table, spreadsheet, etc. (Figure 1, left). Given this image, we then need to extract the table itself (right).

Figure 1: Left: Our input image containing statistics from the back of a Michael Jordan baseball card (yes, baseball — that’s not a typo). Right: Our goal is to detect and extract the table of data from the input image.

Once we have the table, we can apply OCR and text localization to generate the (x, y)-coordinates for the text bounding boxes. It’s paramount that we obtain these bounding box coordinates.

To associate columns of text together, we need to group pieces of text based on their starting x-coordinate.

Why starting x-coordinate? Well, keep in mind the structure of a table, spreadsheet, etc. Each column’s text will have near-identical starting x-coordinates because they belong to the same column (take a second now to convince yourself that the statement is true).

We can thus exploit that knowledge and then cluster groups of text together with near-identical x-coordinates.

But the question remains, how do we do the actual grouping?

The answer is to use a special clustering algorithm called hierarchical agglomerative clustering (HAC). If you’ve ever taken a data science, machine learning, or text mining class before, then you’ve likely encountered HAC previously.

A full review of the HAC algorithm is outside the scope of this tutorial. Still, the general idea is that we take a “bottom-up” approach, starting with our initial data points (i.e., the x-coordinates of the text bounding boxes) as individual clusters, each containing only one observation.

We then compute the distance between the coordinates and start grouping observations with a distance < T, where T is a pre-defined threshold.

At each iteration of HAC, we choose the two clusters with the smallest distance and merge them, again, provided that the distance threshold requirement is met.

We continue clustering until no other clusters can be formed, typically because no two clusters fall within our threshold requirement.

In the context of multi-column OCR, applying HAC to the x-coordinate value results in clusters that have identical or near-identical x-coordinates. And since we assume that text belonging to the same column will have similar/near-identical x-coordinates, we can associate columns together.

While this technique to multi-column OCR may seem complicated, it’s straightforward, especially since we’ll be able to utilize the AgglomerativeClustering  implementation in the scikit-learn library.

 

That said, implementing agglomerative clustering by hand is a good exercise if you are interested in learning more about machine learning and data science techniques. I’ve had to implement HAC 3-4 times before in my career, predominately when I was in college.

If you want to learn more about HAC, I recommend the scikit-learn documentation and the excellent article by Cory Maklin.

For a more theoretical and mathematically motivated treatment of agglomerative clustering, including clustering algorithms in general, I highly recommend Data Mining: Practical Machine Learning Tools and Techniques by Witten et al. (2011).

Configuring Your Development Environment

To follow this guide, you need to have the OpenCV library installed on your system.

Luckily, OpenCV is pip-installable:

  Launch Jupyter Notebook on Google Colab 
 → Launch Jupyter Notebook on Google Colab
Multi-Column Table OCR
$ pip install opencv-contrib-python

If you need help configuring your development environment for OpenCV, we highly recommend that you read our pip install OpenCV guide — it will have you up and running in a matter of minutes.

Having Problems Configuring Your Development Environment?

Figure 2: Having trouble configuring your dev environment? Want access to pre-configured Jupyter Notebooks running on Google Colab? Be sure to join PyImageSearch University — you’ll be up and running with this tutorial in a matter of minutes.

All that said, are you:

  • Short on time?
  • Learning on your employer’s administratively locked system?
  • Wanting to skip the hassle of fighting with the command line, package managers, and virtual environments?
  • Ready to run the code right now on your Windows, macOS, or Linux system?

Then join PyImageSearch University today!

Gain access to Jupyter Notebooks for this tutorial and other PyImageSearch guides that are pre-configured to run on Google Colab’s ecosystem right in your web browser! No installation required.

And best of all, these Jupyter Notebooks will run on Windows, macOS, and Linux!

Project Structure

Start by accessing the “Downloads” section of this tutorial to retrieve the source code and example images.

Next, let’s review our project directory structure:

  Launch Jupyter Notebook on Google Colab
Multi-Column Table OCR
|-- michael_jordan_stats.png
|-- multi_column_ocr.py
|-- results.csv

As you can see, our project directory structure is quite simple — but don’t let the simplicity fool you!

Our 

multi_column_ocr.py
 script will accept an input image, 
michael_jordan_stats.png
, detect the data table, extract it, and then OCR it associating rows/columns along the way.

For reference, our example image is a scan of the Michael Jordan baseball card (Figure 3), when he took a year off from basketball to play baseball after his father died.

Figure 3: Sample output of applying multi-column OCR. Each column is correctly detected, and each cell in a row is assigned to a particular column.

Our Python script can OCR the table, parse out his stats, and then output them as OCR’d text as a CSV file (

results.csv
).

Installing Required Packages

Our Python script will display a nicely formatted table of OCR’d text to our terminal. Still, we need to utilize the 

 Python package to generate this formatted table.

You can install 

tabulate
 using the following command:
Multi-Column Table OCR
$ workon your_env_name # optional
$ pip install tabulate

If you are using Python virtual environments, don’t forget to use the 

workon
 command (or equivalent) to access your virtual environment before installing 
tabulate
 (otherwise, your 
import
 command will fail).

Again, the 

tabulate
 package is used for display purposes only and does not impact our actual multi-column OCR algorithm. If you do not wish to install 
tabulate
, that’s perfectly fine. You will just need to comment out the 2-3 lines of code that utilize it in our Python script.

Implementing Multi-Column OCR

We are now ready to implement multi-column OCR! Open the 

multi_column_ocr.py
 file in your project directory structure, and let’s get to work:
Multi-Column Table OCR
# import the necessary packages
from sklearn.cluster import AgglomerativeClustering
from pytesseract import Output
from tabulate import tabulate
import pandas as pd
import numpy as np
import pytesseract
import argparse
import imutils
import cv2

We start by importing our required Python packages. We have several packages we haven’t (or at the very least, not often) worked with before, so let’s review the important ones.

First, we have our 

AgglomerativeClustering
 implementation from scikit-learn. We’ll use HAC to cluster text into columns, as I discussed earlier. This implementation will allow us to do just that.

The 

tabulate
 package will allow us to print a nicely formatted table of data to our terminal after OCR has been performed. This package is optional, so if you don’t want to install it, simply comment out the 
import
 line and Line 178 later in our implementation.

We then have the 

pandas
 library, which is common amongst data scientists. We’ll use 
pandas
 in this lesson for its ability to construct and manage tabular data easily.

The rest of the imports should look familiar to you, as we have used each of them many times throughout our lessons.

Let’s move on to our command line arguments:

  Launch Jupyter Notebook on Google Colab
Multi-Column Table OCR
 
# construct the argument parser and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--image", required=True,
help="path to input image to be OCR'd")
ap.add_argument("-o", "--output", required=True,
help="path to output CSV file")
ap.add_argument("-c", "--min-conf", type=int, default=0,
help="minimum confidence value to filter weak text detection")
ap.add_argument("-d", "--dist-thresh", type=float, default=25.0,
help="distance threshold cutoff for clustering")
ap.add_argument("-s", "--min-size", type=int, default=2,
help="minimum cluster size (i.e., # of entries in column)")
args = vars(ap.parse_args())

As you can see, we have several command line arguments. Let’s discuss each of them:

  • --image
    : The path to the input image containing the table, spreadsheet, etc., that we want to detect and OCR
  • --output
    : Path to the output CSV file that will contain the column data we extracted
  • --min-conf
    : Used to filter out weak text detections
  • --dist-thresh
    : Distance threshold cutoff (in pixels) for when applying HAC; you may need to tune this value for your images and datasets
  • --min-size
    : Minimum number of data points in a cluster for it to be considered a column

The most important command line arguments here are 

--dist-thresh
 and 
--min-size
.

When applying HAC, we need to use a distance threshold cutoff. If you allow clustering to continue indefinitely, HAC will continue to cluster at each iteration until you end up, trivially, with one cluster that contains all data points.

Instead, you apply a distance threshold to stop the clustering process when no two clusters have a distance less than the threshold.

For your purposes, you’ll need to examine the image data with which you are working. If your tabular data have large amounts of whitespace between each row, then you’ll likely need to increase the 

--dist-thresh
 (Figure 4, left).

Otherwise, if there is less whitespace between each row, the 

--dist-thresh
 could decrease accordingly (Figure 4, right).
Figure 4: Left: The larger the distance between text in cells, the larger your 
--dist-thresh
 should be. Right: The smaller the distance between text in cells, the smaller the 
--dist-thresh
 can be.

Setting your 

--dist-thresh
 properly is paramount to OCR’ing multi-column data, be sure to experiment with different values.

The 

--min-size
 command line argument is also important. At each iteration of our clustering algorithm, HAC will examine two clusters, each of which could contain multiple data points or just a single data point. If the distance between the two clusters is less than the 
--dist-thresh
, HAC will merge them.

However, there will always be outliers, pieces of text that are far outside the table, or just noise in the image. If Tesseract detects this text, then HAC will try to cluster it. But is there a way to prevent these clusters from being reported in our results?

A simple way is to check the number of text data points inside a given cluster after HAC is complete.

In this case, we set 

--min-size=2
, meaning that if a cluster has 
2
 data points inside of it, we consider it an outlier and ignore it. You may need to tweak this variable for your applications, but I suggest first tuning 
--dist-thresh
.

With our command line arguments taken care of, let’s start our image processing pipeline:

  Launch Jupyter Notebook on Google Colab
Multi-Column Table OCR
 
# set a seed for our random number generator
np.random.seed(42)

# load the input image and convert it to grayscale
image = cv2.imread(args["image"])
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

Line 27 sets the seed of our pseudorandom number generator. We do this to generate colors for each column of text we detect (useful for visualization purposes).

We then load our input 

--image
 from disk and convert it to grayscale.

Our next code block detects large blocks of text in our 

image
, taking a similar process to our tutorial on OCR’ing passports:
Multi-Column Table OCR
 
# initialize a rectangular kernel that is ~5x wider than it is tall,
# then smooth the image using a 3x3 Gaussian blur and then apply a
# blackhat morphological operator to find dark regions on a light
# background
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (51, 11))
gray = cv2.GaussianBlur(gray, (3, 3), 0)
blackhat = cv2.morphologyEx(gray, cv2.MORPH_BLACKHAT, kernel)

# compute the Scharr gradient of the blackhat image and scale the
# result into the range [0, 255]
grad = cv2.Sobel(blackhat, ddepth=cv2.CV_32F, dx=1, dy=0, ksize=-1)
grad = np.absolute(grad)
(minVal, maxVal) = (np.min(grad), np.max(grad))
grad = (grad - minVal) / (maxVal - minVal)
grad = (grad * 255).astype("uint8")

# apply a closing operation using the rectangular kernel to close
# gaps in between characters, apply Otsu's thresholding method, and
# finally a dilation operation to enlarge foreground regions
grad = cv2.morphologyEx(grad, cv2.MORPH_CLOSE, kernel)
thresh = cv2.threshold(grad, 0, 255,
cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
thresh = cv2.dilate(thresh, None, iterations=3)
cv2.imshow("Thresh", thresh)

First, we construct a large rectangular kernel that is 

~5x
 wider than it is tall (Line 37). We then apply a small Gaussian blur to the image and compute the blackhat output, revealing dark regions on a light background (i.e., dark text on a light background).

The next step is to compute the gradient magnitude representation of the blackhat image and scale the output image back to the range 

[0, 255]
 (Lines 43-47).

We can now apply a closing operation (Line 52). We use our large rectangular 

kernel
 here to close gaps between rows of text in the table.

We finish the pipeline by thresholding the image and applying a series of dilations to enlarge foreground regions.

Figure 5 displays the output of this pre-processing pipeline. On the left, we have our original input image. This image is the back of a Michael Jordan baseball card (when he left the NBA for a year to play baseball). Since they didn’t have any baseball stats for Jordan, they included his basketball stats on the back of the card, despite being a baseball card (weird, I know, which is also why his baseball cards are collector’s items).

Figure 5: Left: Our input image contains a table of data that we want to apply multi-column OCR to. Right: Output of applying a closing morphological operation to detect large blocks of text. The largest foreground region is the “Complete NBA Record” table we’re interested in and want to OCR.

Our goal is to OCR his “Complete NBA Record” table. And if you look at Figure 5 (right), you can see that the table is visible as one big rectangular-like blob.

Our next code block handles detecting and extracting this table:

  Launch Jupyter Notebook on Google Colab
Multi-Column Table OCR
 
# find contours in the thresholded image and grab the largest one,
# which we will assume is the stats table
cnts = cv2.findContours(thresh.copy(), cv2.RETR_EXTERNAL,
cv2.CHAIN_APPROX_SIMPLE)
cnts = imutils.grab_contours(cnts)
tableCnt = max(cnts, key=cv2.contourArea)

# compute the bounding box coordinates of the stats table and extract
# the table from the input image
(x, y, w, h) = cv2.boundingRect(tableCnt)
table = image[y:y + h, x:x + w]

# show the original input image and extracted table to our screen
cv2.imshow("Input", image)
cv2.imshow("Table", table)

First, we detect all contours in our thresholded image (Lines 60-62). We then find the single largest contour based on the area of the contour (Line 63).

We’re assuming that the single largest contour is our table, and as we can verify from Figure 5 (right), that is indeed the case. However, you may need to update the contour processing to find the table using other methods for your projects. The method covered here is certainly not a one-size-fits-all solution.

Given the contour corresponding to the table, 

tableCnt
, we compute its bounding box and extract the 
table
 itself (Lines 67 and 68).

The extracted 

table
 is then displayed to our screen on Line 72 (Figure 6).
Figure 6: Using image processing techniques, we can automatically detect the largest table present inside our image.

Now that we have our statistics 

table
, let’s OCR it:
Multi-Column Table OCR
 
# set the PSM mode to detect sparse text, and then localize text in
# the table
options = "--psm 6"
results = pytesseract.image_to_data(
cv2.cvtColor(table, cv2.COLOR_BGR2RGB),
config=options,
output_type=Output.DICT)

# initialize a list to store the (x, y)-coordinates of the detected
# text along with the OCR'd text itself
coords = []
ocrText = []

Lines 76-80 use Tesseract to compute bounding box locations for each piece of text in the 

table
.

Additionally, notice how we are using 

--psm 6
 here, the reason being that this particular page segmentation mode works well when the text is a single uniform block of text.

Most tables of data are going to be uniform. They will leverage near-identical font and font sizes.

If 

--psm 6
 is not working well for you, you should try the other PSM modes covered in the tutorial, Tesseract Page Segmentation Modes (PSMs) Explained: How to Improve Your OCR Accuracy. Specifically, I suggest taking a look at 
--psm 11
 on detecting sparse text — that mode could work well for tabular data as well.

After applying OCR text detection, we initialize two lists: 

coords
 and 
ocrText
. The 
coords
 list will store the (x, y)-coordinates of the text bounding boxes, while 
ocrText
 will store the actual OCR’d text itself.

Let’s move on to looping over each of our text detections:

  Launch Jupyter Notebook on Google Colab
  Launch Jupyter Notebook on Google Colab
 
Multi-Column Table OCR
 
# loop over each of the individual text localizations
for i in range(0, len(results["text"])):
# extract the bounding box coordinates of the text region from
# the current result
x = results["left"][i]
y = results["top"][i]
w = results["width"][i]
h = results["height"][i]

# extract the OCR text itself along with the confidence of the
# text localization
text = results["text"][i]
conf = int(results["conf"][i])

# filter out weak confidence text localizations
if conf > args["min_conf"]:
# update our text bounding box coordinates and OCR'd text,
# respectively
coords.append((x, y, w, h))
ocrText.append(text)

Lines 91-94 extract the bounding box coordinates from the detected text region, while Lines 98 and 99 extract the OCR’d text itself, along with the confidence of the text detection.

Line 102 filters out weak text detections. If the 

conf
 is less than our 
--min-conf
, we ignore the text detection. Otherwise, we update our 
coords
 and 
ocrText
 lists, respectively.

We can now move on to the clustering phase of the project:

  Launch Jupyter Notebook on Google Colab
  Launch Jupyter Notebook on Google Colab
 
Multi-Column Table OCR
 
# extract all x-coordinates from the text bounding boxes, setting the
# y-coordinate value to zero
xCoords = [(c[0], 0) for c in coords]

# apply hierarchical agglomerative clustering to the coordinates
clustering = AgglomerativeClustering(
n_clusters=None,
affinity="manhattan",
linkage="complete",
distance_threshold=args["dist_thresh"])
clustering.fit(xCoords)

# initialize our list of sorted clusters
sortedClusters = []

Line 110 extracts all x-coordinates from the text bounding boxes. Notice how we are forming a proper (x, y)-coordinate tuple here but setting y = 0  for each entry. Why is y set to 0

The answer is two-fold:

  1. First, to apply HAC, we need a set of input vectors (also called “feature vectors”). Our input vectors must be at least 2-D, so we add a trivial dimension containing a value of 
    0
    .
  2. Secondly, we aren’t interested in the y-coordinate value. We only want to cluster on the x-coordinate positions. Pieces of text with similar x-coordinates are likely to be part of a column in a table.

From there, Lines 113-118 apply hierarchical agglomerative clustering. We set 

n_clusters
 to 
None
 since we do not know how many clusters we want to generate — we instead wish HAC to form clusters naturally and continue creating clusters using our 
distance_ threshold
. We stop the clustering processing once no two clusters have a distance less than 
--dist-thresh
.

Also, note that we are using the Manhattan distance function here. Why not some other distance function (e.g., Euclidean)?

While you can (and should) experiment with our distance metrics, Manhattan tends to be an appropriate choice here. We want to be very stringent on our requirement that x-coordinates lie close together. But again, I suggest you experiment with other distance methods.

For more details on scikit-learn’s 

AgglomerativeClustering
 implementation, including a full review of the parameters and arguments, be sure to refer to scikit-learn’s documentation on the method.

Now that our clustering is complete, let’s loop over each of the unique clusters:

  Launch Jupyter Notebook on Google Colab
  Launch Jupyter Notebook on Google Colab
 
Multi-Column Table OCR
 
# loop over all clusters
for l in np.unique(clustering.labels_):
# extract the indexes for the coordinates belonging to the
# current cluster
idxs = np.where(clustering.labels_ == l)[0]

# verify that the cluster is sufficiently large
if len(idxs) > args["min_size"]:
# compute the average x-coordinate value of the cluster and
# update our clusters list with the current label and the
# average x-coordinate
avg = np.average([coords[i][0] for i in idxs])
sortedClusters.append((l, avg))

# sort the clusters by their average x-coordinate and initialize our
# data frame
sortedClusters.sort(key=lambda x: x[1])
df = pd.DataFrame()
 
Line 124 loops over all unique cluster labels. We then use NumPy to find the indexes of all data points with the current label, l  (Line 127), thus implying they all belong to the same cluster.

Line 130 then verifies that the current cluster has more than --min-size  items inside of it.

We then compute the average of the x-coordinates inside the cluster and update our sortedClusters

 list with a 2-tuple containing the current label, l , along with the average x-coordinate value.

Line 139 sorts our sortedClusters list based on our average x-coordinate. We perform this sorting operation such that our clusters are sorted left-to-right across the page.

Finally, we initialize an empty DataFrame to store our multi-column OCR results.

Let’s now loop over our sorted clusters:

  Launch Jupyter Notebook on Google Colab
  Launch Jupyter Notebook on Google Colab
 
Multi-Column Table OCR
 
# loop over the clusters again, this time in sorted order
for (l, _) in sortedClusters:
# extract the indexes for the coordinates belonging to the
# current cluster
idxs = np.where(clustering.labels_ == l)[0]

# extract the y-coordinates from the elements in the current
# cluster, then sort them from top-to-bottom
yCoords = [coords[i][1] for i in idxs]
sortedIdxs = idxs[np.argsort(yCoords)]

# generate a random color for the cluster
color = np.random.randint(0, 255, size=(3,), dtype="int")
color = [int(c) for c in color]

Line 143 loops over the labels in our sortedClusters list. We then find the indexes (idxs) of all data points belonging to the current cluster label, l .

Using the indexes, we extract the y-coordinates from all pieces of text in the cluster and sort them from top-to-bottom (Lines 150 and 151).

We also initialize a random color for the current column (so we can visualize which pieces of text belong to which column).

Let’s loop over each of the pieces of text in the column now:

  Launch Jupyter Notebook on Google Colab
  Launch Jupyter Notebook on Google Colab
 
Multi-Column Table OCR
# loop over the sorted indexes
for i in sortedIdxs:
# extract the text bounding box coordinates and draw the
# bounding box surrounding the current element
(x, y, w, h) = coords[i]
cv2.rectangle(table, (x, y), (x + w, y + h), color, 2)

# extract the OCR'd text for the current column, then construct
# a data frame for the data where the first entry in our column
# serves as the header
cols = [ocrText[i].strip() for i in sortedIdxs]
currentDF = pd.DataFrame({cols[0]: cols[1:]})

# concatenate *original* data frame with the *current* data
# frame (we do this to handle columns that may have a varying
# number of rows)
df = pd.concat([df, currentDF], axis=1)

For each 

sortedIdx
, we grab the bounding box (x, y)-coordinates for the piece of text and draw it on our 
table
 (Lines 158-162).

We then grab all pieces of 

ocrText
 in the current cluster, sorted from top to bottom (Line 167). These pieces of text represent one unique column (
cols
) in the table.

Now that we’ve extracted the current column, we create a separate 

DataFrame
 for it, assuming that the first entry in the column is the header and the rest is the data (Line 168).

Finally, we concatenate our original data frame, 

df
, with our new data frame, 
currentDF
 (Line 173). We perform this concatenation operation to handle cases where some columns will have more rows than others (either naturally, due to table design, or due to the Tesseract OCR engine missing a piece of text in a row).

At this point, our table OCR process is complete, and we just need to save the table to disk:

  Launch Jupyter Notebook on Google Colab  Launch Jupyter Notebook on Google Colab
Multi-Column Table OCR
# replace NaN values with an empty string and then show a nicely
# formatted version of our multi-column OCR'd text
df.fillna("", inplace=True)
print(tabulate(df, headers="keys", tablefmt="psql"))
 
# write our table to disk as a CSV file
print("[INFO] saving CSV file to disk...")
df.to_csv(args["output"], index=False)
 
# show the output image after performing multi-column OCR
cv2.imshow("Output", image)
cv2.waitKey(0)

If some columns have more entries than others, then the empty entries will be filled with a “Not a Number” (

NaN
) value. We replace all 
NaN
 values with an empty string on Line 177.

Line 178 displays a nicely formatted table to our terminal, demonstrating that we’ve OCR’d the multi-column data successfully.

We then save our data frame to disk as a CSV file on Line 182.

Finally, we display our output image to our screen on Line 185.

Multi-Column OCR Results

We are now ready to see our multi-column OCR script in action!

Open a terminal and execute the following command:

  Launch Jupyter Notebook on Google Colab  Launch Jupyter Notebook on Google Colab
Multi-Column Table OCR
$ python multi_column_ocr.py --image michael_jordan_stats.png --output results.csv
+----+---------+---------+-----+---------+-------+-------+-------+-------+-------+--------+
| | Year | CLUB | G | FG% | REB | AST | STL | BLK | PTS | AVG. |
|----+---------+---------+-----+---------+-------+-------+-------+-------+-------+--------|
| 0 | 1984-85 | CHICAGO | 82 | 515 | 534 | 481 | 196 | 69 | 2313 | 282 |
| 1 | 1985-86 | CHICAGO | 18 | 0.457 | 64 | 53 | ar | A | 408 | 227 |
| 2 | 1986-87 | CHICAGO | 82 | 482 | 430 | 377 | 236 | 125 | 3041 | 37.1 |
| 3 | 1987-88 | CHICAGO | 82 | 535 | 449 | 485 | 259 | 131 | 2868 | 35 |
| 4 | 1988-89 | CHICAGO | 81 | 538 | 652 | 650 | 234 | 65 | 2633 | 325 |
| 5 | 1989-90 | CHICAGO | 82 | 526 | 565 | 519 | 227 | 54 | 2763 | 33.6 |
| 6 | TOTALS | | 427 | 516 | 2694 | 2565 | 1189 | 465 | 14016 | 328 |
+----+---------+---------+-----+---------+-------+-------+-------+-------+-------+--------+
[INFO] saving CSV file to disk...

Figure 7 (top) displays the extracted table. We then apply hierarchical agglomerative clustering (HAC) to OCR the table, resulting in the bottom figure.

Figure 7: Top: The input table of data we want to apply multi-column OCR. Bottom: The output of applying multi-column OCR. Text bounding boxes with the same color belong to the same column, demonstrating that our algorithm can associate text with specific columns.

Notice how we have color-coded the columns of text. Text with the same bounding box color belongs to the same column, demonstrating that our HAC method successfully associated columns of text together.

A text version of our OCR’d table can be seen in the terminal output above. Our OCR results are very accurate for the most part, but there are a few notable issues to call out.

To start, the field goal percentage (

FG%
) is missing the decimal spot for all but one row, likely because the image was too low resolution, and Tesseract could not successfully detect the decimal point. Luckily, that is an easy fix — use basic string parsing/regular expressions to insert the decimal point or just cast the string to a 
float
 data type. Then the decimal point will be added automatically.

The same issue can be found in the average points per game (

AVG.
) column — again, the decimal point is missing.

This one is slightly harder to resolve, though. If we were to cast to a 

float
 simply, the text 
282
 would be incorrectly parsed as 
0.282
. Instead, what we can do is:
  1. Check the length of the string
  2. If the string has four characters, then the decimal point has already been added, so no further work is required
  3. Otherwise, the string has three characters, so insert a decimal point between the second and third characters

Any time you can leverage any a priori knowledge regarding the OCR task at hand, the far easier time you’ll have. In this case, our domain knowledge tells us which columns should have decimal points, so we can leverage text post-processing heuristics to improve our OCR results further, even when Tesseract performs less than optimally.

The most significant issues with our OCR results can be found in the 

STL
 and 
BLK
 columns, where the OCR results are simply incorrect. We cannot do much about that since that is a problem with Tesseract’s results and not our column-grouping algorithm.

Since this tutorial focuses specifically on OCR’ing multi-column data, and most importantly, the algorithm powering it, we’re not going to spend a ton of time focusing on improvements to Tesseract options and configurations here.

After our Python script has been executed, we have an output 

results.csv
 file containing our table serialized to disk as a CSV file. Let’s take a look at its contents:
  Launch Jupyter Notebook on Google Colab  Launch Jupyter Notebook on Google Colab
Multi-Column Table OCR
$ cat results.csv
Year,CLUB,G,FG%,REB,AST,STL,BLK,PTS,AVG.
1984-85,CHICAGO,82,515,534,481,196,69,2313,282
1985-86,CHICAGO,18,.457,64,53,ar,A,408,227
1986-87,CHICAGO,82,482,430,377,236,125,3041,37.1
1987-88,CHICAGO,82,535,449,485,259,131,2868,35.0
1988-89,CHICAGO,81,538,652,650,234,65,2633,325
1989-90,CHICAGO,82,526,565,519,227,54,2763,33.6
TOTALS,,427,516,2694,2565,1189,465,14016,328

As you can see, our OCR’d table has been written to disk as a CSV file. You can take this CSV file and process it using data mining techniques.

What's next? I recommend PyImageSearch University.

 
 
 
 
 
 
 
 
 
3:52
 
 
 
 
 
 
 
 
 
Course information:
35+ total classes • 39h 44m video • Last updated: April 2022
★★★★★ 4.84 (128 Ratings) • 13,800+ Students Enrolled

I strongly believe that if you had the right teacher you could master computer vision and deep learning.

Do you think learning computer vision and deep learning has to be time-consuming, overwhelming, and complicated? Or has to involve complex mathematics and equations? Or requires a degree in computer science?

That’s not the case.

All you need to master computer vision and deep learning is for someone to explain things to you in simple, intuitive terms. And that’s exactly what I do. My mission is to change education and how complex Artificial Intelligence topics are taught.

If you're serious about learning computer vision, your next stop should be PyImageSearch University, the most comprehensive computer vision, deep learning, and OpenCV course online today. Here you’ll learn how to successfully and confidently apply computer vision to your work, research, and projects. Join me in computer vision mastery.

Inside PyImageSearch University you'll find:

  •  35+ courses on essential computer vision, deep learning, and OpenCV topics
  •  35+ Certificates of Completion
  •  39+ hours of on-demand video
  •  Brand new courses released regularly, ensuring you can keep up with state-of-the-art techniques
  •  Pre-configured Jupyter Notebooks in Google Colab
  • ✓ Run all code examples in your web browser — works on Windows, macOS, and Linux (no dev environment configuration required!)
  • ✓ Access to centralized code repos for all 450+ tutorials on PyImageSearch
  •  Easy one-click downloads for code, datasets, pre-trained models, etc.
  •  Access on mobile, laptop, desktop, etc.

CLICK HERE TO JOIN PYIMAGESEARCH UNIVERSITY

Summary

In this tutorial, you learned how to OCR multi-column data using the Tesseract OCR engine and hierarchical agglomerative clustering (HAC).

Our multi-column OCR algorithm works by:

  1. Detecting tables of text in an input image using gradients and morphological operations
  2. Extracting the detected table
  3. Using Tesseract (or equivalent) to localize text in the table and extract the bounding box (x, y)-coordinates of the text in the table
  4. Applying HAC to cluster on the x-coordinate of the table with a maximum distance threshold cutoff

Essentially what we are doing here is grouping text localizations with x-coordinates that are either identical or very close to each other.

Why does this method work?

Well, consider the structure of a spreadsheet, table, or any other document that has multiple columns. In that case, the data in each column will have near identical starting x-coordinates because they belong to the same column. We can thus exploit that knowledge and then cluster groups of text together with near-identical x-coordinates.

While this method is simplistic, it works quite well in practice.

Citation Information

Rosebrock, A. “Multi-Column Table OCR,” PyImageSearch, 2022, https://pyimg.co/h18s2

@article{Rosebrock_2022_MCT_OCR,
author = {Adrian Rosebrock},
title = {Multi-Column Table {OCR}},
journal = {PyImageSearch},
year = {2022},
}

Want free GPU credits to train models?

  • We used Jarvislabs.ai, a GPU cloud, for all the experiments.
  • We are proud to offer PyImageSearch University students $20 worth of Jarvislabs.ai GPU cloud credits. Join PyImageSearch University and claim your $20 credit here.

In Deep Learning, we need to train Neural Networks. These Neural Networks can be trained on a CPU but take a lot of time. Moreover, sometimes these networks do not even fit (run) on a CPU.

To overcome this problem, we use GPUs. The problem is these GPUs are expensive and become outdated quickly.

GPUs are great because they take your Neural Network and train it quickly. The problem is that GPUs are expensive, so you don’t want to buy one and use it only occasionally. Cloud GPUs let you use a GPU and only pay for the time you are running the GPU. It’s a brilliant idea that saves you money.

JarvisLabs provides the best-in-class GPUs, and PyImageSearch University students get between 10-50 hours on a world-class GPU (time depends on the specific GPU you select).

This gives you a chance to test-drive a monstrously powerful GPU on any of our tutorials in a jiffy. So join PyImageSearch University today and try it for yourself.

CLICK HERE TO GET JARVISLABS CREDITS NOW

To download the source code to this post (and be notified when future tutorials are published here on PyImageSearch), simply enter your email address in the form below!