Extraction Pipelines
Maybe your document extraction needs are served by a simple API call: perfect, well-defined data goes in and extraction results come out. But many companies will find themselves in a situation where things aren't so simple.
- The incoming data has a lot of variety. Maybe you've just got a lot of it, or maybe it's user-submitted, and you can't be sure what they'll send.
- The incoming data isn't in the format you want. Maybe it's cell phone photos, PDFs, or the results of a web scraper.
- You're routing data between many different extraction jobs. Sometimes even a single business task involves extracting data from many different documents.
You thought document extraction was all about AI models, but in most cases, before you even get to that point, you have to think about something else: the pipeline. This chapter provides an overview of extraction pipelines, walks you through the steps you might need in yours, and teaches you the questions you should be asking your team.
Grandparents clipping newspapers, a million times a second
Imagine your grandparents are on a screened-in porch in Florida drinking iced tea and clipping newspaper articles to send you. Articles about Elon Musk go in one envelope. Articles about the 49ers in another envelope. Grandma highlights sentences she thinks are important and writes a note in the margin.
Suspecting there’s a business to be made sending these newspaper clippings, your grandparents scale up. They rent a bingo hall and set up a full scale operation.
- The Mailroom flips through incoming magazines and newspapers, pulling out each section: sports, business, and so on.
- The Clippers specialize in sections. They find articles worth clipping and cut them out with scissors.
- The Highlighters gives each clipped article a deep read, highlighting sentences and writing grandmotherly comments in the margins.
At its heart, this newspaper clipping operation is 80% of a modern day document extraction team. (If you find these grandparents, you should hire them immediately.) Document extraction teams would divide this newspaper operation into two main pieces:
- The Pipeline is the bingo hall operation - the mailroom and the clippers. Sorting through document bundles, clipping out articles of interest, and routing them to the deep readers.
- The Model is the deep reading team. Highlighting sentences of note and writing extra notes in the margins.
While exciting articles in the tech press tend to focus on the model, in practice the pipeline is just as important. A great model fed by an awful pipeline is a worthless overall system.
The structure of an extraction pipeline
It's useful to think of an extraction pipeline as having different steps that (1) prepare and route the document so that it's ready for the right model, and then (2) prepare and package the model results so that they're ready for use.
There's no one common pipeline design. Every company's pipeline will look a bit different depending on their own needs. But the picture below contains steps that you might encounter in yours.
These steps are:
- Format, in which you convert documents to the necessary medium for processing
- Segment, in which you extract just the pieces(s) of each document you intend to process
- Route, in which you match segments of the document to the models appropriate for them
- Model, in which you perform extraction
- QA, in which you give humans a chance to check the model's work
- Package, in which you bundle the results of the pipeline for consumption and archival
Read on for a breakdown of each of the these steps and the questions that will help you think clearly about it if you encounter it in your own business.