Simulating Extraction

You've now got:

Let's simulate the extraction itself so that we're not just cherry picking individual examples. We want to force ourselves to test whether this information really is enough and learn how much post-processing will be needed to interpret the information. If the last two steps were a horizontal survey of the use case, then this step is a vertical survey.

Select as many documents as you can, but a good sweet spot is between five and ten. More is always better, but it's also important you don't put this off for later. Five to ten documents is few enough for you to do in an afternoon but numerous enough for you to start forming an opinion.

Create a spreadsheet. Your first column is the document's filename and the remaining columns are the fields you need to collect.

Go through the documents and find input values yourself, copying the data from your files exactly as it appears in the document.

  • If there is a typo, don't fix it
  • If there is a strange data format, don't convert it
  • If a piece of the data spans two pages and, as a result, cross a page footer or header, include the page footer and header in the value

Your goal isn't to extract perfect data. Quite the opposite: your goal is understand what the data really looks like from the computer's perspective, with no extra processing applied to it.

You're going to be tempted to quit after three files. This is boring work, and after three files you've probably spent at least an hour on it. "Should't my tech team handle this?" Trust me: they're going to be doing far more of this sort of work, and you can make it all go more smoothly by starting off with the examples you're creating now. Keep going. And think about all the boring work this automation project will save you down the road.