Layout and Image Understanding

The explosion of deep learning work in the NLU space has so far been mostly about text in its most basic form: unstyled characters and whitespace. But the documents that matter to business are rich in information of so many visual forms.

In some cases, this visual information is an extraction target itself. Checkboxes, signature blocks, multiple choice questions, company logos, and identity photographs are all values companies need to extract along with text.

In many other cases, visual layout is a powerful clue about the semantics of the text presented on a page:

  • In a traditional business letter, the senders address appears at the top right, while the recipient's address appears at left below. The fact that one address belongs to the sender and another to the recipient is entirely communicated by convention of layout.
  • On a page with multiple addresses, the fact that a zip code belongs to a particular address as opposed to another is conveyed entirely by layout: the grouping of a set of address fields is due to their collective proximity.
  • On a government form with hundreds of fields, the association between label and value can be crystal clear using layout information but confusing without. If field labels are printed above (not beside) the space for answers, the text from an OCR will be rendered unhelpfully as Label1 Label2 Label3 Label4 Value1 Value2 Value3 Value4. This is difficult to interpret without knowing that Value1 came beneath Label1.
  • On a copy-protected identity card, background images and holograms can hinder an OCR's ability to read text. But with knowledge of the layout and visual design of that card, the OCR model can apply specialized processing to extract the text amidst the visual noise.
  • In tables, layout semantics are the prime conveyor of relations between different regions of text. Without a model that understands layout at a fundamental level, it is extremely difficult to reliably extract tabular data at scale.

Expect to see NLP models "graduate" from processing text alone and begin processing text, layout, and image information at the same time. These joint models should dramatically improve model accuracy while also enabling more easily retargetable models. The grammar of document layout is somewhat universal: a pre-trained model that encodes it can leverage that knowledge across document types.