Vetting the documents available for automation

The first vetting step defined a target: some valuable business function with a well defined set of inputs. This second vetting step looks at the raw materials that provide those inputs: our documents.

Our goal here is twofold:

  1. Investigate whether each data field really does exist inside documents available, not somewhere else like a database or external source
  2. Investigate the breadth of expression used to describe the data in the documents

For each field you listed in Step 1, grab a new piece of paper and make a chart that looks something like this. Think of this as a Pinterest mood board for your data. You're going to fill it with examples you find.

Diagram

Gather a pile of input documents that you believe contain the data for this field. Go through them and manually find the data yourself. Don't worry about formally labeling data yet. Right now you're just clipping examples to put on your mood board. Your goal isn't to find great examples but rather good illustrations of the full variance you see.

Arrange the data clippings you find on the two axes you drew. This will start developing — and documenting! — a sense of what data in the real world is available to service this required input to your business process.

Diagram

Here's how to organize your data clippings along the two axes:

  • Example Simplicity
    "Simple" is your estimate of how easy a job you think the computer will have extracting the data. There are no wrong answers here: the important thing is that you create a spectrum. For a currency, $65,000 USD might be the simplest, one thousand dollars per month might be messier still, and no less than the maximum fine as prescribed by state law might be the most complicated. Or perhaps some values are typed in a field (simple) whereas others are handwritten (messier) and still others are crossed out and corrected in the margin (messiest)
  • Example Frequency
    Estimate how often each type of data instance occurs. Messy examples are a lot less concerning if they are rare, so you'll want to know these estimates when making the decision about whether to proceed. Don't worry about large sample sizes that would please a statistician yet — your goal here is to make ballpark estimates, but based on what you see in the data, not what you hope in your head.

Finally, go back and revise the data type of your field. Originally you might have listed it as a "Currency Amount," but now — in light of your examples — it's clear that's the idealized type, not the actual type. The actual type might vary: it could be directly stated or indirectly stated, it could be a checkbox or selection circled with pen ink, etc.

What you now know

When you're done, you have a document that answers several critical questions about your automation project:

  • Were you able to find the information you needed in each document? If not, document automation alone might not automate this business process.
  • Is the information recorded in the way you thought it was? If not, more data review will be useful before pursuing automation.
  • Is the information recorded in a consistent way? In a handful of consistent ways? With no consistency whatsoever? This is valuable information to provide your engineering team.
  • Is there a clear "80% case" for each field, or will an 80% solution require addressing a long tail of different situations? This will help your product management team decide how to plan for inevitable errors that will occur in production.

Armed with this data you can begin making estimates about your overall strategy and return on investment before even bringing AI into the mix. It may be that something is a clear home run — or a clear dud. Or you may now know to anticipate a bifurcated solution in which computers handle certain common cases while humans handle the rest.

This will also become a critical document when discussing solutions with your engineering team or vendors. Armed with a document like this, vendors can give you fairly high-quality feasibility estimates even without seeing the real dataset.