Quality Assurance and Feedback Loops

The final major pipeline component is where humans have the opportunity to comment on and correct the model's performance. You may think this isn't necessary if your model performs well on initial tests. That would be dangerous: "model drift" is a well known phenomenon in which a model's performance decays over time as subtle environmental factors change.

Regardless of how manual or automated your human review process is, here are some of the questions you will need to ask.

When should you look at the data?

The three meaningful time-posts here are:

  • Before anything uses the data
    Humans become a real-time piece of the pipeline, having the opportunity to review some or all data before it continues on to whatever process consumes it.
  • Post-hoc, within a window of easy correction
    A comparison here would be the owner of a restaurant comparing dining tickets to credit receipts at the end of a long day. The credit network already has the charges, but they won't be fully processed for another day or two, leaving a window within which any errors can be silently corrected.
  • Post-hoc, as an audit
    At whatever time-scale meets your needs, spot check records from the last click of the clock.

What's important to note here is that "straight-through processing," the holy grail of automation, can still be achieved with humans in the loop, using this second style. Computers can have the license to act fast, while humans review the books at intervals regular enough to reverse any bad decisions after the fact. Large enterprise teams will inevitably contain one stakeholder insisting on straight-through processing and another stakeholder insisting that automation is too risky. This is a good compromise to explore.

What data should you look at?

There are many ways you can select what data gets reviewed, and you can mix and match them at will. This is another topic in which stakeholders concerned with different forms of risk can find much to agree on when they realize how much flexibility of process they have.

  • Full Coverage
    Don't jump to the conclusion that having to review all data makes automation "not worth it." The effort to check a decision is often dramatically less than the effort to make the decision in the first place. Even with humans checking 100% of an automated pipeline's decisions, the cost savings of automation can still be immense.
  • Random Sampling
    A good idea to make sure no rocks are left unturned. Randomly sample data points, or blocks of time, and manually review all data in that sample.
  • Biased Sampling
    A good idea to test for systemic failures. Select all data points with a known trait (documents from the Singapore office, high-income account openings, etc) and review everything in that sample. If you identify a major systemic failing of your model, you can continue sampling inputs from that category to, (1) resolve the issue using humans, and (2) generate training data such that the issue can be resolved automatically in the future.
  • High-stakes Sampling
    A variant of biased sampling in which you define the properties of a datapoint too expensive to be wrong on, and always put human eyes on it before a downstream system can use the data. Keep in mind that this introduces a structural bias in your system: you would be committing to more errors on what you define as "low-stakes" data points.
  • Validation Sampling
    Define validation rules for the data exiting your model and send any data points that violate them to a human review process. See the section below on writing good validation rules.
  • Confidence-based Sampling (aka Active Learning)
    Many models can output a confidence value, and your platform should automatically route low confidence results to a human-review queue. This will give humans an understanding of where the model perceives itself to be confused, which can inform error handling. The best platforms will also let humans correct or verify answers on the spot and incorporate that data into model retraining. This creates an active learning feedback loop in which the model is always guiding humans to provide more data for the areas in which it struggles the most.

What kinds of validation rules should you write?

Validation rules are so important, they deserve a religion. If you talk to any serious engineer, they will tell you that at least half of any real production code is playing defense: testing for things that might have gone wrong and handling them. The beauty of computers is that you can do thousands of these tests on every data point, every time. For all the fears that computers will make rookie mistakes compared to their human counterparts, well-written validation rules can go a long way toward catching them.

If validation rules had a religion, it's adherents would chant the mantra: "It is always a good idea to add another validation rule." Why? Because it is always a good idea to add another validation rule. As long as it's a good one.

How do you write good validation rules?

  1. You check the data from many different angles. Errors can be made in subtle ways, sometimes ways that appear right. So you need to come at the validations from different angles in case an error is subtle enough slip by some of them.
  2. You remember that the point of a validation isn't to mark an item right or wrong, it's to get a human involved. This gives you wiggle room: you can write validations that sometimes capture data that isn't wrong so that you get higher coverage over the data that is.

Here are some examples:

  • Syntactic Validation
    Is the data of a well-formed type? Does the currency value have letters in it? Does the name value have a dollar sign in it? Is the social security number the correct number of digits?
  • Semantic Validation
    Is the data a coherent instance of its type? Is the amount field of a paycheck negative? Is the age field of an employee record greater than two hundred?
  • Outlier Validation
    Is the data a likely instance of its type? Does the employee's trip receipt contain seven thousand different items? Is someone's hourly wage thousands of dollars?
  • Internal Consistency
    Can values in the data item be recomputed and validated using other values also in that data item? For example, an employee paystub might contain gross pay, tax rate, and tax withholding. The first two can validate the internal consistency of the third.
  • External Consistency
    Can values in the data item be validated against third-party data? For example, an employee ID number could be validated as existing in the employee database.
  • Layout Validation
    If your system reports where in a document a value was extracted from, the regularity of some document types will allow you to use this as a validation feature. For example, the total amount field in an invoice might always appear below the itemized breakdown.
  • Design Validation
    If your system reports features such as font-size, font-weight, and logo, you can use these as basic authenticity validators. If the total amount field in your company invoice is always in boldface, an validation fault should be thrown if it ever isn't.

Writing validation rules is a bore. But every rule your team writes is a rule that pays you back on every data point, every time. Remember the mantra: "it is always a good idea to add another validation rule."

How you should benefit from human corrections

Ideally, the software or processes you employ to perform human corrections will enable six main results.

  • Confidence
    Of the humans working together with computers toward a common goal
  • Predictability
    Of how often your system will be right and wrong
  • Dashboarding
    To understand how the model is performing with respect to human judgement
  • Data Correction
    Of errors made by the pipeline or model
  • Audit Logs
    To understand the model's performance drift over time and inspect special cases
  • Re-training
    To add examples to your model's training set so that future iterations become more performant