When getting the data is the biggest problem
IDC estimates that data will grow 50-fold by 2020, while the number of 'information containers' will multiply by 75. This growth presents two basic problems, of course: first, how to handle the data; and, second, what to do with it.
Or so, at least, say the developer and the data scientist. Both the Hadoop enthusiast and the R-scripting researcher just want to peruse the data, and listen to what the data says. But from the business angle, this order is reversed. Business goals determine software methods; and in most cases 'Big Data' is really the handmaiden of 'Business Intelligence'.
This means that, for the developer really interested in serving business interests, 'the data' isn't just a given, to be mined for useful insights. If the available data-points don't serve business interests, then all the analytics in the world are useless. Wherefore the business-embedded software developer must consider carefully -- not only how to analyze the data-chunks, but also how to make the right data-chunks available for analysis.
So let's think about one of the most annoying, intractable pain-points every business-oriented software developer still has to face:
Volumes of business-critical data-points are still locked away in physical documents, or in other 'locked' formats inaccessible to unavailable to most big data analytics systems and search engines (including images from mobile device cameras, image-only pdfs, tiffs, and jpegs).
Those old 'paperless office' promises are beginning to wear thin: physical documents and other sources of locked information are appearing faster than ever before.
At first glance, perhaps, the pure software technician might think this is someone else's problem. 'Let the paper people worry about the paper! I deal only in bits.' But this is not a serious answer -- not for the scientific researcher, who seeks all the knowledge in the world; and definitely not for the business-embedded developer, who seeks maximal actionable information. The historian tells the postdoctoral assistant: 'Find out how many lieutenants died in the Battle of the Somme. I don't care how; just do it.' And the Dow-Theorizing investor needs every data-point the NYSE produced in all of 1932..when they didn't have computers...
So the developer and the data-scientist now want this: a black box whose input is a physical document, and whose output is digital data.
More than sheet to disk
OK, OCR. But press harder: the developer who wants to transform locked images into data doesn't want just data; the developer really wants structured data.
And, sure, you can do your NLP and your data mining and pull some structure out of some text. But here's the other annoying thing about dealing with physical documents: all physical documents are structured, if only by relative position on the page. No physical document is as unstructured as a sheer data stream. So if your straight-up OCR just turns a document into a data stream, then you're actually losing information, because you're losing contextual information about the document: its structure, embedded images, table order, paragraph order, etc.
The problem now becomes: how do you turn an essentially physical structure, which is not rigidly standardized, into data whose structure is rigidly (and usefully) defined?
In other words, you need to use Optics to Recognize more than a Character. You also need to group the C's into Words and Numbers, Paragraphs and Prices, Titles and Identifiers -- structures that are already defined by the page, on the one hand, and demanded by Business Intelligence, on the other. But the one must be mapped onto the other.
So your black-box needs more than one function. You need a stack-box, a series of processes that translate paper to data without losing any information on the way. And your stack-box needs to extend into physical space -- in industry terms, a capture system.
Six steps from structured page to structured data
Roughly speaking, your stack-box needs to handle these phases:
- Acquire digital image
- Clean up image
- Structure image
- Recognize characters
- Synthesize and export
Let's take each of these phases in order.
Scanning is the traditional and most-adopted method to get paper information digitally transformed and integratable into a back-end system (e.g. search, analytics, ERP systems, etc.). The need for a good scan is obvious: if you miss image data, then you'll never reach BI-level information.
Image-capture devices typically use one of two kinds of sensors: CCD (Charge-Coupled Device) and CIS (Contact Image Sensor). CCD is classic technology: in 2009 it earned its inventors the Nobel Prize in Physics. The Nobel goes only to awesome and influential discoveries; and indeed most flatbed scanners still use CCD. Part of the reason for CCD's continued popularity is, semiconductor-transistor-like, its incredibly cleverness and simplicity. CCD has longevity because it is nothing like a quick hack.
The basic photoelectric physics is (mathematically) pretty simple, and was understood reasonably well before Einstein's annus mirabilis paper (bounce light off a page, photoactive surface is charged by photon bounce-back, and the rest is explained very nicely on HowStuffWorks). CIS works basically the same way, but with one (mirror) step removed (which takes less power).
It can make a big difference in character recognition accuracy to use the right initial scanning techniques. Even though visually, the scan looks good to the naked eye, there is lots of information in the image that might be lost with the wrong image type, or the wrong scanner settings. An initial resolution of at least 300 DPI with the output being set to grayscale or color retains the best data within the image for OCR. Bi-tonal scanning in black and white loses a lot of background information and may degrade the recognition. For example, a simple bitmap scan in black and white can reduce your ability to apply image-optimization algorithms in later steps.
OCR systems will typically take the scanned image and prepare it for OCR by converting it to a binary image -- a black and white, bi-tonal raster image. To create this, OCR technology uses an intelligent background filtering with adaptive binarization. Much of the additional information in a grayscale or color image is used by the system to create this black and white image so that the text in the background is separated and the individual characters are kept intact.
Designed in 1991, Twain is a standard software protocol and API that regulates communication between software applications and imaging devices such and scanners and digital cameras. The word TWAIN is from Rudyard Kipling's "... and never the twain shall meet.." illustrating the technical challenge in the early 90's connecting scanners to PC's. Other protocols include Windows Image Acquisition (WIA), Scanner Access Now Easy (SANE), Image and Scanner Interface Specification (ISIS).
Besides scanning, there are other ways to acquire the physical world into the digital:
With the emergence of mobile device optics now reaching 5-10 mega-pixels, capture systems can now be integrated with images originating from these devices. Once the image is captured remotely, it can be compressed, encrypted and sent to server-side processing via HTTPS protocol (typically REST). These "Mobile Data Capture" systems require different types of image clean-up that we describe in the next stage.
Also, images can be acquired via memory or storage in mass, if large amounts pre-existing images need to be processed from an archive.
2. Clean-up (Image Pre-processing)
So far, so physics. Slick and fun, but not the professional developer's worry. Now your amps and your pixels are in sync: your document is an image.
But the one-to-one blackness-mapping from paper to scan isn't ready for meaningful processing just yet. Your eyes do a lot of image-pre-cleaning too, even before the action potentials cascade to the brain -- from pupil contraction to foveal specialization, to simply getting used to the dark; and what happens once image-data does reach the brain is anyone's massive field of methodologically- and experimentally-tangled hypereducated guess. So at this stage your software needs to keep acting eye-like and do something smarter than straight-up photoelectrics. Your stack needs to clean the scanned image quite a bit before it can even begin to think about the document's structure. Think of documents:
And your software will need to do a lot at this stage in standard scanning environments:
- de-skewing (think of scanning a book with tight binding; or photographing a page from an angle; or a weird, arbitrarily warped documents, like ancient scrolls)
- clipping (if the scanned page does not fill the entire scan space: the CCD doesn't know that the scanner lid doesn't contain any useful information)
- rotating (because character recognition depends on finely-tuned angle-measurement, among other things)
- despeckling (paper is porous and rough, and often less than perfectly clean; various algorithms are used)
- straightening & splitting (think of that scanned book, with two leafs)
- adjusting brightness (still thinking of that book: the shadows near the middle don't mean anything; the simplest technique used here, binarization, is quite well understood (1, 2, with this rather aged (2004) but very lucid comparative survey) and helpfully visualized here, with a range of effectiveness levels compared here)
In alternate environments, like mobile, context-specific clean-up routines can be applied as well:
- ISO noise removal (speckling specific to camera optics which can effect accurate text extraction)
- key-stoning correction (perspective distortion specific to hand-held devices)
- auto edge detection (for example, the image of a driver's license may contain the wood-grain from a table-top behind the license)
- dynamic binarization & shadow removal (shadowing in a crumpled receipt image or low lighting, again common in images from mobile devices)
Of course, as a developer, you don't have time to worry about bizarre particulars like 'what kind of dust is likely to absorb extra CCD light' or 'how reflective scanner-lids' inner surfaces usually are' or 'how do I optimize for low-lighting from a mobile image of a document'. But you don't have to, because there's an excellent SDK that handles everything at this pre-processing stage.
In recent years, OCR engines have become extremely accurate, thanks chiefly to AI techniques applied to document structure. If the machine understands the document pattern as a human would understand it, this translates directly into more informed, more accurate character-level extraction. This is known in the industry as 'document analysis'.
Document analysis has additional benefits as well. It allows developers to retain the page text order and layout, effectively retaining more meta-data about the document than straight text level extraction can provide. This can inform data analytics systems with more information about the document. For example, document analysis can give information that identifies and categorizes regions of interest.
This ability to detect and identify zones, such as text body, illustrations, math symbols and tables is called semantic labeling. It essentially builds a taxonomy of the document. Consider one of the simplest meaningful spaces: a gridded table. Tables are easy to recognize visually; more importantly (for our BI purposes), tables contain very simply (two-dimensionally) structured data. Once a part of a page is recognized as a table, the rest of the sheet-to-structured-data translation can occur quite straightforwardly.
In fact, most documents are composed of just five kinds of meaningful blocks:
(Note that not all of these actually contain words; and of those that do contain words (text and table), one (text) probably contains full sentences, while the other (table) does not. This will become important during the Character Recognition phase proper -- since intelligent OCR uses syntactic analysis and grammatically rich lexica to guess at the likely letter, just as brains do.)
Block-type recognition is crucial because each block-type encodes a particular sort of data. Text is (at this stage) unstructured, at least next to tabular data. Pictures are of course massively information-rich, and visual semantic processing is a fascinating subject whose current methods span just about every AI/machine learning method available. (You already know that applying these methods to a table would be absurdly overkill.) Barcodes are unique identifiers (as none of the other block-types are), and checkboxes provide simple binary values for variables important enough to merit their own checkbox on a page.
Each block-type offers a tell-tale visual fingerprint: tables, including their rows and columns are bounded by orthogonal (usually black) lines; pictures contain complex color gradients; text-blocks and tables both contain words, but text-blocks are usually not surrounded by printed lines.
Text-blocks in particular can often be subdivided further. Chapters, paragraphs, sentences and words are all meaningful units -- as are several more complex kinds of text-block-subsections, like 'table of contents' or 'picture caption'. Really intelligent software will build additional levels of hierarchy under each of these top-level block-types (like this) and verify every high-level hypothesis at every point available.
As in the previous stage, these blocking problems are pretty well-studied, and are usually solved for developers by a good SDK.
Here we finally reach OCR's R-stage. And as every page-informationalizing developer must have thought at least once: 'Why do you typesetters have to use so many different fonts? It's all just text; just make it all look the same!'
In fact, however -- and thanks to many years of research and improvement -- most modern OCR software can recognize characters at a relatively high level of abstraction. Here are the four basic kinds of classifiers used in in ABBYY's FineReader engine (probably the most advanced OCR engine available today):
- raster classifier: compares the character with an image consisting of all variations on a character, superimposed in one character-space. Fast and simple (bitmapping isn't very smart) -- a good first pass.
- feature classifier: like the raster classifier, but smarter: instead of taking all possible character variations as the pattern image (as the raster classifer does), the feature classifier identifies key 'giveaways' (like a particular dot-pattern at the bottom right serif-junction of a Times New Roman capital E -- somewhat analogous to face recognition algorithms that latch on to the (naturally rare) T-shape composef of eyes, nose, and mouth, or the unique shading at the bottom of the eye-socket).
- contour classifier: like the feature classifier, but grabs onto predefined contours rather than bitmaps.
- structure classifier: the most awesome type of classifier: it actually breaks the character into constituent parts, and re-arranges the parts to match a known letter-pattern. This is the OCR technique most akin to old-fashioned paleographical techniques scholars have long used to decipher ancient manuscripts (excellent overview here; practice here); in fact, ABBYY developed the structure classifier to handle handwriting.
Two more classifiers are called when two very similar characters must be distinguished (D and O, for example, or sloppily written cursive l and t) -- the feature-differentiating and structure-differentiating classifiers (more details available here).
While not always used in high volume scanning systems, there are cases where human verification of results are needed for absolute accuracy.
Every good OCR system (like every good automation system) will help developers minimize human involvement, but take advantage of human input whenever possible. Character and word-level recognition results are typically stored in sorted array (shape, confidence) pairs, where confidence denotes how certain the recognition variant is. These confidence values can be used to highlight individual uncertain characters or words and to calculate the overall recognition confidence of a document. The developer can use this information to establish minimum confidence-thresholds, below which the document needs to be pushed into a manual verification workflow.
Quite simply: if a particular auto-recognized document has not reached the required confidence-level, then the system can simply ask a human for help with particular questionable words or characters (always a vastly smaller subset than the entire document). And then (this is key) it can remember how the human interpreted that particular character or word for the rest of the current batch of documents.
6. Synthesize and Export
Now that the structured data is extracted from the document, it needs to be packaged up so that it can be absorbed by your analytics or archiving system.
Typically, for analytics systems or others that are structured database-driven applications, Extensible markup Language (XML) file formats are used. Most systems have API's that are used to process XML data, and many schema systems exist to standardize and assist in the definition of XML-based languages. So you want to be sure your OCR engine supports the XML format you need for your project.
Your XML output will include an abundance of character level and data on the structure of the document:
- Various levels of format retention information (identification and coordinates of columns, tables, frames, fonts, font size, paragraph styles, borders, etc.).
- The order of various blocks
- Access to detailed information about each recognized character (coordinates and confidence level)
- Character parameters (what language it is, was it corrected, was a dictionary used to identify it, is it numeric vs. alpha etc.)
- Font effects (what font is it, spacing information, color, etc.)
All this is used by the end-point system to extract all the information about a digitized document for more accurate analysis.
Further reading and practice
And now we've translated a document from useless physical page to useful digital information. Follow the numerous links in this article for deeper reading.
But since you're a developer, you probably want to do something as soon as you understand it. So check out the free-trial SDK that inspired this article: ABBYY's FineReader Engine, which includes excellent documentation and useful code samples.
And stay tuned for part 2 in this series, which dives deeper into more complex kinds of document layout semantics -- including some details on the software that turns scanned tax forms into usefully structured data.