All your senses grasp the physical world wonderfully -- but eyes dominate by far, for both biological and cultural reasons.
Human eyesight does something computers do, too: both eye-position and retina structure developed to direct the information-firehose very deliberately toward a carefully chosen, relatively narrow field. Ears aren't like that -- they can't be closed, and a good deal of the information-modeling is pre-performed in the cochlea itself. Ears pick up danger; eyes make knowledge.
And, like a machine, eyes do much better when fed domain-specifically. Optical illusions are more widely known than auditory or haptic illusions for neurobiological reasons, not just cultural history. Eyes need to know what they're looking for.
This is even more true for real-world document analysis. AI and machine-learning methods are being applied to semantic image analysis by the basketful; but most images aren't photographs, and most useful Business Intelligence is found not in photographs, but in preformatted documents.
In a previous article we touched briefly on an OCR/ICR engine's movement from sight to meaning. There 'meaning' was considered very broadly: barcodes contain different information than text-blocks, semantic subsections of text-blocks are usually set apart by headings, and so forth.
This kind of analysis does read space intelligently. It does not grasp domain-specific spatial meaning -- 'is this a 1040A form', and 'what does the variable on line 34 on Form 1040A represent', for example. But many physical documents whose layouts convey meaning visually are not composed in a general, domain-neutral code. (Line 34 always represents the same variable, on every Form 1040A.) And many of these kinds of visually-domain-specific documents contain precisely the kind of data machines can handle best -- better even than the sharpest human.
Domain-specific spatial codes: the case of forms
Consider tax forms. If you're a citizen of the United States, you'll be (too) familiar with the W-2 form already. (If not, take a look at this pdf.) All W-2's contain the same data, as mandated by law. But what data goes where (a) can't be intuited from general document-layout principles, and (b) varies significantly from form to form. (For example, line 34 may be in a different place on the California W-2 than on the New York W-2; in fact, there are about 600 variants of the W-2 format in the US.) And yet it seems intuitively obvious that the W-2 is ideally suited to digital digestion. So if 'this is a document' doesn't tell you 'and this spatial block contains your annual wages', then how can a machine tell what the annual wages are?
As you've probably guessed: this is where domain-specific spatial codes kick in.
Transmuting the general design principles of domain-specific languages into document recognition -- the data-extraction system must answer these three questions:
- What kind of document is it?
- What data should I extract from it?
- How do I find this data?
If you know the kind of document, you'll be able to use everything you know about similar documents to find what data to extract and how. For example, if you know that every W-2 contains salary information, and if you recognize this document as a W-2, then you know you'll be looking for salary information (along with a dozen other data-points). If you know what data to extract, you know (a) where to spend your CPU cycles, and (b) what visual reference-markers can help you. For example, if you know you want to extract the employer identification number from the W-2 form, you'll know to look for the letters 'EIN', then store the value nearest to these letters as the employer identification number. In some cases you'll probably want to validate using a regex, too -- e.g., 'every social security number needs to be in the format xxx-xx-xxx'.
Let's examine each of these questions in more detail.
What kind of document is it?
Approach this problem in two ways: look at the content, or look at the form. Content-based classification systems input domain-specific vocabularies and classify the document accordingly (commonly done in clinical contexts, for example). Those are neat, and enjoy years of research (this 2003 literature review was already quite extensive) but in many cases overkill: they don't take advantage of the fact that classifying by form layout, for instance, can make NLP much more effective.
Moreover, different organizations need different levels of classification granularity. The basic data-modeling question is (and a whole industry is devoted to answering this question): What is the taxonomy of document classification a particular organization needs?
Once this is modeled, there are two distinct technologies for classification:
- image-based (form): wherein a corpus of images is stored in a database and incoming images are compared to it by form, and then classified. This tends to be fast but less accurate and not very granular.
- text-based (content): more CPU intensive, but very granular. For example, you might use a series of anchor text to capture related words around the anchor. More AI-intensely, you might use semantic technology, i.e., mapping words to concepts. Typically each word in a given language will relate to several possible concepts. This is a human-like way to use linguistic morphology to use context to disambiguate the various meanings of a given piece of text.
The ideal classification engine would combine image-based, text-based, and location-based approaches (see below) to support intelligent, self-learning auto-classification. The system would be trained by loading a batch of sample documents of the same class into the system, and then educating the system on the document and text patterns.
In many cases, the form -- the physical location of items on the page -- will tell you enough to classify the document.
At deeper levels of granularity, especially when a single kind of document appears in numerous subtle variations, more direct control is necessary -- ideally by means of software that can create and manage rules that define how elements within a document relate to each other within the document classes you're teaching the system to define. But at first pass -- and simply because human brains subdivide visual space very adroitly -- adequate enumeration and mapping of which layout features are sufficient to classify a document by type is best done with a visual tool.
Basically, at first pass, the problem is: organizations define documents in countless, often very different ways. And each organization wants to do something different with each bit of data, even when the data-points themselves fall into the same category (e.g., one company sends invoice numbers directly to the bookkeeper, while another sends the invoice numbers to the store manager). So document-analysis systems need not just domain-specific, but even organization-specific information about document content and forum. (Contrast the image on the right, which represents the output of a document-analysis system that has been trained to handle invoice document-types in particular, with the image on the left, which is simply trying to draw intelligent inferences from general document-layout principles.)
What data should I extract from it?
So now your machine knows the document type. This means that you know what kind of data the document might contain, and have a rough idea of where on the page a particular piece of data might appear. Assume you've done your homework, and trained your machine to recognize numerous variations on a document -- six different types of W-2, for example -- even if the machine handled a good deal of that work on its own (e.g., by noticing the pre-defined key-phrase 'EIN'). What's the next step?
If you've defined your document well, you can now map the input page to possible outputs. But to make your document-analysis system really useful, you need to build more filters into the system itself. You know what data you really want, and you know the form contains that data; but you need to tell your system precisely what data to extract (and then where that data should go).
Moreover, when your documents are semi-structured, your map from document-space to data-space isn't enough. Invoices, for example, usually assign some kind of visual marker to order numbers, item quantities, per-item prices, total order price, etc. Semi-structured documents, in other words, usually contain the same data, often with the same or similar markers; but this data is arranged very differently. Starting your mapping from the data-side as well as the document-side will help especially in these cases.
How do I find this data?
This is particularly hard when the document is semi-structured. If your document is fully structured, then your pre-definition assigns variable space adequately. But if your document is only semi-structured, your system will need to infer from relations between data-point and possible variable-label -- using a page grammar (with techniques derived from generative linguistics).
Further, your system should be able to receive both structured and semi-structured document inputs, and know whether to interpret the structure via rigidly-defined layout or fuzzier page-grammatical inference. Vast-volumed and vastly-complex use-cases like CompuPacific's are larger than average, but even a sole proprietor will quickly appreciate how annoying it would be to distinguish different 'levels of structuredness' manually during document-input.
But what if your documents' organization demands more than one interpretive step? For example, consider an invoice with two account numbers -- one for the vendor and the other for the customer. The system then can't just find the nearest (or right-nearest, or down-nearest) integer-string to the character-string 'account*'. Rather, the conditional logic needs to branch one more time -- and this can't be done with purely visual mapping.
That's why one of the coolest features of ABBYY's FlexiCapture Engine (which I recently had a chance to examine) is its full-fledged document-analysis UI (FlexiLayout Studio). The tool's purpose: to template-ize classification schemata, at precisely the level of granularity required (and no more).
The simplest schemata can be produced in three steps:
- Load a selection of documents with different layouts.
- Define some generic elements that allow documents to be identified and that can be used for orientation within a single document (e.g. text strings, lines, spaces between elements).
- Define search elements for the data you're trying to extract (e.g. text, numbers, date, tables, string length, character set, one or multiple words, one or multiple lines).
Each element in step 3 should also be related to the generic elements identified in step 2 (e.g. 'right' or 'below').
But sometimes more sophisticated logic is required, and in these cases a scripting language becomes necessary. From the SDK documentation:
This language allows you to specify element properties, relations between elements, and set additional constraints for formulated hypotheses.
Elements, properties, and relations mean just what you might guess ('static text', 'max length', 'below', etc.). More interestingly, the language lets you:
- script search logic
- specify search area (both rigidly and fuzzily)
- find nearest element (using pre-defined 1-2 parameter functions)
- set 'penalty coefficients' to value or devalue a particular hypothesis about correspondence between elements
The basic idea is that you can exert control over engine-generated hypothesizing both before and after element-searches are performed.
The value of a scripting language at the 'how do I find this data' stage should be pretty clear to any developer.
General vs. domain-specific document analysis
The basic differences between the two kinds of document analysis represented by FineReader (general) and FlexiCapture (domain-specific) are summarized here. Both engines are free for trial (FineReader, FlexiCapture) and extensively documented. The tools are easy enough to use that you don't need to know much about the basic principles of intelligent document analysis; but reading a little background material can't hurt, and can often help you take full advantage of these tools' power.
Best of luck turning your paper documents into useful, business-intelligent information!