What is my model capable of learning?

What can your document extraction model learn? A useful way to approach this question is to define the space of document inputs that extraction systems usually get.

Let's split the space of documents into two different axes:

  • Content complexity
    How complicated is the information itself? Is it an exact set of forms or labels? Is it roughly stable set of forms and labels? Any form?
  • Expression complexity
    How complicated is the expression of that information? Is it a rigid form? A set of forms of the same general type? Anything-goes natural language inside business letters?

In broad strokes, where your use case falls within these two categories determines what type of models can be a good fit for you.

Documents that are low-complexity on both axes are a great fit for both rule systems and deep learning. In this low complexity space, rules do a good job of "letting simple things stay simple," along with having positive side-effects such as explainability and low computing requirements. Meanwhile deep learning systems applied in this space can allow a single extraction approach to learn with little input and scale to more complex documents.

Documents that are high-complexity in either category are a good fit for deep learning (or other learned approaches) to manage the amount of nuance necessary in finding an automated solution.

This chapter shows you how to estimate where your use case lies on these complexity axes, using industry experience as a guide.

Estimating Content Complexity

Content complexity is how complex and regular the information you're trying to extract is. Low complexity corresponds roughly to "a single paper form that never changes" and the high end corresponds roughly to "all forms in the world about that topic."

At the low-complexity end are documents that always contain the same information. Every US Drivers license has the same fields. Every invoice your company issues always uses the same template. Whether a contract requires US Dollars is always a yes or no answer.

In the medium range, you're targeting documents that have roughly the same information, but it may vary from case to case. Different countries store different fields on their drivers licenses. Different invoices may report different fees and taxes. Different contracts may involve payment from one or more currencies.

At the high range, you're targeting documents of roughly similar goal but very little restriction on how they approach that goal. Documents that identify people (as opposed to drivers licenses specifically). Documents that track good and services (as opposed to invoices specifically). Documents about money exchange (as opposed to documents known to contain payment terms).

Reading these descriptions, it's easy to see how the content complexity can grow beyond a human's ability to write rules to handle it. At a certain point relying on the computer's ability to learn becomes necessary.

Estimating expression complexity

Expression complexity is how complex the expression of content is within a document. Low complexity corresponds roughly to "a rigid form that never changes" and high complexity corresponds roughly to "a typed legal document."

At the low-complexity end are forms and templates. Documents that have a rigid visual structure in which the same information is always at a known place relative to fixed layout or fields in the document. US Drivers licenses, license plates, and printable paper forms are all in this category.

The medium range consists of two main cases:

  • Increased form diversity. Form or template "families" that have roughly the same information but vary in layout and fixed language. Examples would be drivers licenses across all countries, or IRS 1040 forms across all tax years.
  • Structured natural language. Natural language text that conforms to a rigid vocabulary, grammar, and purpose. Examples would be medical prescriptions, military unit orders, and the structured language stock traders use to discuss trades with each other.

At the high range, you're targeting documents that tend to have information embedded inside natural language. These might be letters, contracts, reports, or emails. Where and how the information is expressed is entirely up to the person that authored the document.

As with the other axis, it's easy to see how learned approaches quickly become necessary as the set of rules required for medium and high complexity grow beyond that which a human can manage.