Learning representations and solutions

You're going to have to form opinions about modeling approaches, and it will be tempting to use the crutch of "newer is better." To some extent this is true, but it's far more useful to understand how models have evolved in terms of capability over time. From that lens, you might realize older is better for your needs — it all depends on what you want humans to be responsible for and what you want the model to be responsible for.

Let's get abstract for a moment and ask the question "how is a problem solved with computers?"

In general, to solve a problem, you've got to do two things:

represent the problem
in some way that's useful for finding a solution
solve the problem
that manipulates your representation until it becomes the solution

If it were chess, the representation would be the chessboard and the rules of the game of chess. The solution would be a winning strategy for some particular game of chess that adhered to the rules of the representation.

All document extraction (and artificial intelligence) models ever created can be described in terms of the evolution of who performs the two tasks of representation and solution: the human or the computer.

Diagram

Era 1: Rule Systems
In the early days of AI, humans handled both representation and solution. They created representations such as rule systems that allowed for various logical axioms to be stated, and then they used those systems to build rule sets and knowledge bases that would produce the solution to some problem. Today we call these "rule systems" or "expert systems."
Era 2: Machine Learning
In the 1990s and 2000s "machine learning" peaked, in which humans represented a problem using tunable parts, and then the computer solved the problem by analyzing data to find the best tuning for those parts. Think of this era as linear regression on steroids. The human would assert that y = mx + b was the best representation for some problem—they just didn't know what m and b were. Then the computer would crunch numbers in a spreadsheet to determine the best values of m and b.
Era 3: Deep Learning
In 2010s and beyond, computers have begun making decisions both about problem representation and solutions within that representation. The computer, rather than the human, decides how raw data should be summarized into higher-level concepts, and how those higher-level concepts relate to each other, and what kind of manipulations can be performed to generate a solution.
Era 4: Meta Learning (prediction)
This is a prediction of what comes after Deep Learning. The representations and solutions deep learning models produce are still constrained by network architecture and training regime. Not every architecture is able to learn every class of problem. A learning method that was able to move fluidly from architecture to architecture within different regions of its world representation — and choose to do so while learning — would have vastly generalized abilities over today's deep learning models. So much so that this might suffice for building an artificial general intelligence (AGI).

A quick note for ML Pros

This way of carving up the AI world — as who handles "representations" and "solutions" — isn't perfect, of course. Representation learning didn't start with deep learning (e.g. principle component analysis), and deep learning's representational choices are still constrained by humans (e.g. neural architecture selection). But this framing is mostly accurate, and more importantly, useful as a trend-line with which to understand how different forms of AI interact with the business and engineering teams that use them.

Why representation learning is a big deal

The nature of your representation matters tremendously. A good representation can render problems easy to solve. And a bad representation limits the effectiveness of any algorithm that you write with it. Imagine trying to extract fields from a PDF form with a problem representation that didn't preserve the spatial relationships between words. That might sound like a crazy representation but it's a common and effective one for document classification (called "bag of words")!

Problem representation is hard. Here is just a small set of questions someone writing a heuristic-based document extraction program might have to ask:

Is a document an unsorted bag of words within? A collection of sentences? Paragraphs? Pages?
Is a capitalized word the same as its lowercased variant?
Do yet-unseen words (e.g., from a foreign language) convey meaning or are they ignored?
Is the location of a word on the page meaningful? It's location relative to other words?
Is boldface text different than regular text?
Do words have categories, such as parts of speech? Do these categories impact decision making?
Can two words mean the same thing? Can two names refer to the same entity?
If two words are 90% similar, can they be considered the same word?

When Humans Represent Problems

When humans are in charge of problem representation, they can remain in complete control of the nature of solutions that are possible to even express:

They can ensure that any solution found will be fully explainable in human terms. "This loan was rejected because the applicant has four loans outstanding, and we matched their name against our database using these matching criteria."
They can provide domain specific languages for easy addition of new rules to the system. Just about anyone can learn to write rules like this: "On ACCOUNT_OPEN if ACCOUNT.PARTNER is in PREFERRED_PARTNERS set ACCOUNT.PRIORITY = 100"

But humans struggle to represent problems at scale:

Their representation is limited by their own imagination and stamina. There might be seven hundred different factors that actually impact an automated decision, but the human might only know about (or have time for) one hundred of them. This means any solution is structurally prevented from incorporating the other six hundred.
Their representation is limited by their ability to describe it to the computer. This becomes especially difficult when processing images: you might know a "signature" is a valuable representational element when processing contracts, but how do you describe what it means to be a signature to the computer?

When computers represent problems

Putting the computer in control of problem representation has its own set of tradeoffs.

The computer can find a problem representation directly from the data, with no need for domain experts. This changes the type of team you need to automate a problem.
Computers can explore units of meaning far more abstract and numerous than humans are capable of. These are called embeddings in the deep learning world. A human signature might be represented as the conjunction of a hundred abstract calculations on the alternating colors in a region of space - these abstract calculations could have never been devised by a human.

But computers have their own faults:

Interpretability of the representation is not guaranteed. The computer might end up with a representation that forces it to find the right answers but for the wrong reasons. You won't know until things go haywire in production because an unnoticed environmental factor changes.
Interoperability with human systems and processes is not guaranteed. Valuable domain knowledge that should be added to the model might be difficult to incorporate because it comes from an alien representational world.

Nevertheless, much of the buzz around deep learning relates to its ability to devise problem representations on the fly. This has allowed it to solve harder problems than rule systems are capable of and to do so with less human effort to boot.

Automating Paperwork

A practical overview for enterprise

Learning representations and solutions

A quick note for ML Pros

Why representation learning is a big deal

When Humans Represent Problems

When computers represent problems