Should I build or buy?
Should you build in-house document extraction capabilities or buy software to handle it for you? If the track record of the software industry is a guide, then it's likely you will be using plenty of third-party software either way. Few companies today would consider building their own email client, database, or CRM software, unless their corporate strategy specifically called for it. The document extraction space will be no different.
The real question, then, isn't whether to build or buy. It's what kind of software to buy and what kind of internal team to build to manage it. A good way to approach this question is to ask yourself the following two questions:
- How bespoke are your inputs and outputs?
- How much do you need to modify the underlying model and pipeline?
A grid like the one below translates your answers to these questions into a purchasing strategy.
The API Case
If your inputs and outputs are universal standards, and you don't care about modifying your data pipeline, you need an API. This is the liquor store category: customers have ID cards, you need to scan them, and you don't really care how it all happens. You just need to know if they are of legal drinking age.
If that category sounds like you, then a good place to start your search is by looking for companies that provide APIs for exactly what you need.
- If your data can touch the cloud, this might be a cloud-based SaaS offering
- If your data must live on-premise, this might be a licensable Docker instance
- If you must integrate with an existing codebase, this might be a licensable development library or open source project
Because the NLU industry is just getting started, there's no guarantee that "API-style companies" will exist, even for fairly standard documents. But they're growing increasingly numerous and worth considering if they suffice for your needs.
The Platform Case
If your inputs and outputs are bespoke, and you need access to custom models and pipelines, you need a platform. This is the Fortune 500 category: multinational businesses with every data environment imaginable and very specific requirements about how automation decisions get made and recorded.
Teams in this category will want to search for companies that provide the infrastructure to perform the full lifecycle of document extraction: curating documents, labeling examples, training models, hosting models, routing documents to the right models, and recording extensive metadata about performance and corrections.
The tools in your platform need not come from a single provider. The same tradeoffs apply here as they would in any other space. Sourcing all of your NLU platform from a single provider will give you interoperability benefits, but no one provider is the best at everything, so you will forgo best-in-class capabilities for at least some areas.
The Extraction Pipelines and Extraction Models sections of this book assume you've taken this route and offer ways to think about the decisions you'll have to make.
The In-Between Case
The space in between these two clear choices is the "bespoke, but not exotic" situation. You might want to automate the handling of documents specific to your business, but your needs otherwise look completely standard. In this situation you'll have to decide how to proceed based on how technical you'd like your internal teams to get.
If you build in house, you will want to search for companies that highlight having general-purpose models that can be retargeted easily for your specific use case. In general, these companies will emphasize that your role isn't to build or host a model, but rather to gather and annotate a training dataset. Even if the company pitches their solution as completely self-service, you might interact with their engineers at some point — perhaps to tune your model or discuss what kinds of training examples you need to collect.
If you decide to outsource, you will want to search for companies that emphasize their sales engineering and customer success engineering teams. These may be the same companies you'd use to build a model yourself, the only question is who is using their product: you or them. You may still be expected to provide all of the data gathering and labeling work.