Supporting complex outputs and domain constraints

This book focused on extracting information from documents "as-is," but in practice there are two other critical tasks that make this information useful:

  1. Interpreting extracted information
    e.g. knowing that "one thousand dollars" is the same as "$1,000 USD"
  2. Grouping extracted information into complex outputs
    e.g. knowing that a street address, city, state, and zip code together form an Address

This second need — forming complex outputs — is a necessary step to re-build an information model atop the raw values extracted so that they can be used together. And yet this rebuilding process is completely unaddressed by much of today's NLU stack.

Consider the information extracted from a hypothetical shipping manifest:

Diagram

One way to extract data like this is by using clues from layout and text labels. But that would fail to include much of the domain knowledge humans would use to extract these fields and assemble them into the right complex objects.

As humans, we know certain things are true about the way object properties relate to each other. In these addresses:

  • The US has a country code of +1 and France has a country code of +33. We intuitively know to group the phone number with a +1 country code with the American recipient.
  • French IDs are 13 digits and US Social Security Numbers are 9 digits. We intuitively know that the ID which appears to be a Social Security Number is a property of the American recipient.

Knowledge such as this is called "domain constraints." Applying domain constraints is a powerful technique in rule-based systems, both to help find complex objects with internally consistent properties but also to select the best value among candidates for any one particular property.

Document extraction systems will likely begin to incorporate these kinds of constraints. Models will learn them and extraction IDEs will give humans the ability to view and manipulate them. This will give teams the ability to guide the behavior of models not just at the atomic level of extraction, but with meaningful knowledge about the domains that relate to the document being processed.