Multilingual by default

Prediction: Multilingual extraction models will soon have a cost-to-value ratio that outperforms monolingual models.

Production NLP systems have traditionally been monolingual. Supporting multiple languages in an OCR came at the cost of speed and accuracy, and supporting multiple language in an extraction model came at the cost of great development expense.

But neural models trained on multilingual datasets are increasingly common and offer accuracy that is close enough to monolingual models that they may be a better tradeoff for global companies. The result of these new models is — frankly — stunning. Consider the following data points:

  • Facebook's fastText library offers both an English-only variant and a 157-language multilingual variant. This latter variant enables engineers to embed concepts in a translingual space, meaning a rule system with the rule "Return the phrase after Employee Name" could match paper forms in all 157 languages.
  • OpenAI's GPT-3 is a predictive language model trained on many languages simultaneously (how many is unknown). This enables it to, for example, perform question-answer services in English about documents written in Chinese with no modification whatsoever. The image below is from an experiment I conducted using Chinese language tutor listings.

Diagram

It's reasonable to expect these improvements will quickly make their way into production extraction pipelines, offering easy handling of many languages at once.

How it works

Deep learning works by passing an input through a cascade of "neuron layers." The final layer represents the output of the network as a whole. Researchers have discovered that as these networks learn, these layers tend to specialize in the identification of successively complex concepts.

In an image processing network, early layers might learn to detect patterns of color, or sharp lines and edges, or blurry patches. Downstream layers might use that information to begin specializing in shape detection. Later layers might use that shape information to specialize in higher-level objects such as eyes and mouths. And final layers might use these higher-level objects to identify people.

In a language extraction model, it appears that training on multlingual datasets pushes the internal layers to find language-independent abstractions. The layers on either end of the network then translate words from a target language in and out of this translingual representation.

Here is an illustration using a pretend multilingual neural network that classifies nouns into the food category most likely to fit them:

Diagram

Multilingual models come at a cost: they are larger and less performant than their monolingual counterparts. But they are catching up quickly to the point where difference in performance may be eclipsed by the utility of only having to have one model for all regions of the globe.