Lower training data requirements

Prediction: The need for large training corpora will become a thing of the past for many problems because of pre-trained base models. A private market for licensing and fine-tuning pre-trained models will soon exist, and this will form the backbone of most applied AI development.

Deep Learning has the odd distinction of requiring both more training data than ever and also less training data than ever. Let me explain.

If you want to train a deep learning model from scratch, you're going to need (usually) an extremely large dataset. But it turns out that the layer-based representational structure of deep learning models makes them incredibly capable of retargeting their behavior. (See the conversations on multilingual models and one-size-fits-all models for related discussion of this property.) This has resulted in the following two-step approach that's become common across the industry.

  1. Train a single general purpose model at great expense
    At great expense, train an enormous general purpose model on as much data as you can possibly amass. Google-scale if possible. The internal representation of this model will gravitate towards "common sense" representations present in that large dataset: important concepts and relationships, and how they combine to form new concepts and relationships.
  2. Fine-tune it into many special-purpose models at small expense
    Using a small set of use case-specific documents, re-train just the final few layers of your general purpose model. Because you're holding most of the neural network frozen at this point, you're re-using all of that general-purpose "common sense" knowledge learned on the large dataset. The fine-tuning of the final few layers teaches the network to apply that knowledge to your documents in particular.

The ramifications of this approach are hard to overstate — both in terms of capability and business model.

  • Use cases with very scarce training data availability may become viable candidates for learned systems. Before, these use cases could only be addressed by rule systems and heuristics.
  • Businesses will likely begin licensing general-purpose models trained at great expense by machine learning experts, along with the right to fine-tune the final layers. These general-purpose models may focus on particular languages, topics (product reviews, finance, news articles), or document formats (forms, government filings, invoices). New standard licensing and IP agreements will be necessary to identify what it means to be a "derivative work"