"The Model"
The model is where your documents transform into the input data for your business process. Before the model, the data in your pipeline is still fundamentally human-centric: raw text, layout, and maybe page images. After the model, your data is a structured object: keys and values or rows in a database.
This section focuses on the model from the perspective of your data pipeline — how it interacts with the external infrastructure connected to it. There are enough discussions in this space to fill a weighty book, so consider this a summary of the main topics you'll need to navigate.
Cloud readiness and scalability is a major differentiator
Cloud readiness remains a major differentiator in the model world. Fifteen years into the cloud hosting era, we're used to the idea that software can be deployed and scaled with little thought to the hardware that runs it. Microservices have taken this to the extreme of allowing virtualization and scaling at the function level.
Hosting models is conceptually similar to hosting microservices in some respects — both are functions that take a single input and produces a single output — but the complexity of running models places them all over the map in terms of infrastructure requirements.
Some models can deploy to your cloud as easily as a web page. Others require custom engineering and extraordinary computing resources. Merely running (not training) OpenAI's famous GPT-3 model is estimated to require a computer that would cost upwards of one-hundred thousand dollars per year in cloud fees.
The easiest way to explain this incredible variance is to provide the answers you might hear to the question, "why can't my model fit inside a microservice?" Some answers are:
- Some can!
But with caveats. Depending on the memory footprint of your model, your microservice cluster might take minutes to scale in response to demand spikes rather than seconds. You'll want to plan your minimum cluster size accordingly. - Because of the memory footprint
The performance of deep learning models comes at the cost of enormously large parameterization. GPT-2, for example, contains over 1.5 billion parameters taking up over 5 gigabytes of memory. This exceeds the design parameters for many microservices clusters. - Because of the GPU requirement
All models can be run on regular CPUs, but many deep learning models are so large that GPUs are required for acceptable performance in production. This exceeds the design parameters for many microservices clusters. - Because of the "cold start problem"
In order to load balance, microservices often spin down, removing themselves from system memory and causing a "cold start" to be required upon the next invocation. If that cold start includes loading a large model into memory, taking seconds or even minutes, your model consumers will experience occasional latency spikes.
Different styles of plug-and-play
Offering "plug-and-play" model support is a major architectural undertaking not to be undervalued in a platform design. If you are building your own platform, it's something you'll need. And if you're evaluating someone else's platform, it's something you will want to test.
Many existing platforms will take one of three plug-and-play strategies:
- Container / Virtual Machine Level
These platforms give you the most flexibility, imposing no particular requirements on your model. As long as you provide you model wrapped in a container or virtual machine, they will run it on a hardware with the features you need. The downside is overhead: the platform can't make any optimizations on your behalf, and you're left configuring and running an entire virtual machine for even simple models. Google's Cloud Run Service is an example of this architecture. - Programming Framework Level
These platforms offer automated model deployment and scaling for any model that conforms to a particular software framework, such as TensorFlow, PyTorch, MXNet, or HuggingFace. This compromise allows platforms to provide specialized hosting and scaling on proprietary hardware (like Google's TPUs of Amazon's Inferentia) and automated integration with model-specific logging and monitoring systems. - Model Family Level
These platforms offer retraining and fine-tuning of a pre-selected set of pre-trained models packaged with the platform. Users don't implement and train their own models, they customize pre-existing models by annotating their own data and tuning parameters.
If your company has a sizable machine learning team and wants to be in the business of developing, training, and hosting your own models, you will probably want a pipeline that offers model plug-and-play at the container or programming framework level. These two levels are targeted at machine learning practitioners who build their own models in-house.
If your company wants to leverage the machine learning smarts of a third-party, a service that offers plug-and-play at the model family level is your best choice. You can identify these companies because they will advertise a model that solves a category of problem. It might be form extraction, or table extraction, or legal document extraction, or identity card extraction. You should expect them to have a very precise ability to tell you where it will work, where it won't work, and what kind of data you will need to provide so that they can fine-tune their model on your dataset specifically.
Latency is a major differentiator
A new technology is constrained by different things during different stages of its growth. At present, deep learning appears to be very hardware-constrained. Put another way: the software of deep learning models is currently capable of things the hardware just can't do yet.
When you read stories about the incredible performance of a model like GPT-3, that performance is as much a feat of infrastructure engineering as it is model design. These models are squeezing every last electron of performance from the silicon that powers them.
This means that a model's speed is still very much a differentiator.
Pick any category of model on the market and you'll find the same tradeoff: accuracy versus model size, and model size correlates to execution speed. When a programmer downloads an image classification model or text translation model, the first choice they have to make is which variant, in size, to download. When an iOS developer asks the iPhone to interpret the text in an image, they have to explicitly tell the phone whether to optimize for speed or accuracy.
What this means for a project manager is that you can't separate speed from accuracy. You need to consider them jointly.