How to reason about agent performance

It feels like half of silicon valley is working on Agents1 right now, and that's caused a lot of "will it work?" debate:

  • Are Agents demoware, unable to work in the real world?
  • Are Agents Skynet, destined to take over the world?

This is a post for folks outside the agent community. The goal is to show you a simple, quantifiable, way to think about agents and their future capabilities.

You'll be able to use the framework below to argue both sides of the demoware-Skynet debate, but with numbers.

The core Agent challenge is being right many times in a row

Chatbots have constant user feedback, the error tolerance of friendly chit-chat, and are often evaluated at the level of isolated "request-response" prompts.

Agents are different.

They work in relative isolation to take a high level task request \(T\) ("Book me a ticket to London"), break it down into into subtasks ("Visit united.com", "Click on Flights", ..), and then execute those subtasks... all without a human stepping in.

The output of each subtask \(t_i\) feeds into input (or context) of the next subtask \(t_{i+1}\).

$$T = t_1 \rightarrow t_2 \rightarrow \dots \rightarrow t_n$$

That means if even one subtask is chosen poorly, or executed poorly, the failure cascades down the chain. No human there to correct.

  • If the agent bakes a cake, but adds salt instead of sugar, it doesn't matter how good the icing is. The cake is going to be terrible.
  • If the agent visits a travel website, but clicks Book Hotel instead of Book Flight, it doesn't matter how good it is at filling in dates. The reservation won't be for air travel.

Failures in task chains turn brains into flunkies

If you know the success rates of each subtask \(t_i\), you can approximate2 the overall success rate by multiplying them together.

$$P(T) = \prod_i P(t_i)$$

This is equivalent to saying: "for the overall task to be successful, each subtask has to be successful".

The thing about multiplying probabilities is that they only get smaller. The more subtasks you have, the worse your overall task success will be.

You can watch how fast things go from good to bad with you calculator. Try this experiment:

Let's say CousinVinnyLLM is the hot new Legal LLM on the block. It makes legal decisions with 90% accuracy. In the US, that's an A on a test -- pretty good!

But watch what happens when CousinVinnyLLM has to make five decisions in a row:

  • 1 Decision: \(0.9\)
  • 2 Decisions: \(0.9 * 0.9 = 0.81\)
  • 3 Decisions: \(0.9 * 0.9 * 0.9 = 0.73\)
  • 4 Decisions: \(0.9 * 0.9 * 0.9 * 0.9 = 0.66\)
  • 5 Decisions: \(0.9 * 0.9 * 0.9 * 0.9 * 0.9 = 0.59\)

The odds all five decisions are correct is 0.59. CousinVinnyLLM just went from an A student to an F student.

So even if CousinVinnyLLM scores an A on the LSAT, it't not necessarily a A-grade lawyer. In fact it's probably a much worse lawyer.

Performing a chain of tasks correctly is significantly harder than performing a single task correctly.

And most interesting work, whether buying a plane ticket or taking over the world, requires significantly more than five steps.

Side note: Your brain is damned near perfect at what it does

Making a peanut butter and jelly sandwich involves at least a hundred discrete actions.

You have to pick up the peanut butter. Turn the cap.. no -- the other direction! Stop turning the cap. Remove the cap. Pick up the knife.. no -- put the cap down first! Try picking up the knife again.. no -- from the handle!

The fact that you can do this, every time, without stuffing the bread in your pants or stabbing your pet parakeet means your brain is operating with more 9s of reliability than the Golden Gate fog.

Now you can quantify if Agents can do any task

..or rather: you can estimate what would have to be true for an agent to do a task of known complexity.

You just need two inputs and a rewritten version of the equation above.

  • \(R(T)\) - The success rate of Task \(T\) you require
  • \(N\) - The number of decisions / steps involved in doing Task \(T\)

For an Agent to successfully do Task \(T\), the success rate must be at least your required success rate.

$$P(T) \geq R(T)$$

With the simplification above, that means the product of subtask rates must be at least the required success rate.

$$\prod_{i = 1 \dots N} P(t_i) \geq R(T)$$

We can make equation-rewriting easier by substituting the average subtask success rate in for the the actual components3. This lets us write the product as an exponent.

$$ P(t)^N \geq R(T)$$

Now we just have a single variable to solve for -- the average subtask success rate:

$$ P(t) \geq R(T)^{\frac{1}{N}}$$

So for an Agent to succeed at rate \(R(T)\) on task \(T\) with \(N\) steps, you need the average subtask success rate to be equal or greater to

$$R(T)^{\frac{1}{N}}$$

Let's check our work

Let's plug back in the CousinVinnyLLM numbers to make sure we got that right.

  • \(R(T)\) = 0.59 was CousinVinnyLLM's overall success rate
  • \(N\) = 5 was the number of steps in CousinVinnyLLM's legal filing

If we did that right, we can just plug these two numbers in to get \(0.9\) back out: the per-subtask success rate.

$$ \begin{align} P(t) &\geq {R(T)}^{\frac{1}{N}} \\ P(t) &\geq {0.59}^{\frac{1}{5}} \\ P(t) &\geq 0.9 \end{align} $$

Cool.

So what would have to be true for AI to book me a flight?

Now you're armed with a fairly justifiable way to quantify questions like "are Agents demoware." or "are Agents skynet?".

Just pick a specific task that represents the scenarios you're interested in. Here are four example ones:

  • Monitor and reserve camp sites near my city before they book up
  • Identify and execute arbitrage opportunities between Amazon and eBay
  • Build and maintain a Shopify store with products from a popular Reddit
  • Respond to FTC comment requests from the point of view of a parakeet

Next, decompose that task into a set of actions. These actions have to be very fine grained -- at the level of "click on this," "type that," "decide what to type" -- the way a computer needs to do it. The number of these actions is your \(N\) value from above.

Then, select the overall level of accuracy that represents your threshold for either demoware or skynet. This is your \(R(T)\) value.

Now plug it in and calculate what the average per-subtask accuracy would need to be:

$$ P(t) \geq R(T)^{\frac{1}{N}}$$

The final step -- to keep yourself honest -- is to find the hardest subtask in your task decomposition. Take a look at that side-by-side with the average subtask success rate required. Is it realistic?

  • Until you can hit that success rate, it's going to be hard to accomplish the overall goal - Demoware
  • When you can hit that success rate, it'll be easy - Skynet

The N-fold hat trick metric

We're already testing LLMs with LSAT-style tests, where each question is independent.

To track AI's progress toward building Agents, I think a useful synthetic metric atop these tests would be the odds that \(N\) random questions selected from a test were all answered correctly.

You could call it an "N-fold hat trick" metric.

It's a crude metric. And it will always be a lower than the score on the test from which it was derived. But what's nice is you can derive it directly from existing benchmark data. No new tests required.

And it gives a clear, quantified way to think about the question: "How well can agents solve task chains of length \(N\) in this topic area?"

Final thought: LLMs don't need to improve for Agents to improve

Think about how your brain achieves that near-perfection it has.

Imagine yourself standing on one leg.

You sway, back and forth, self-correcting.

Any one subtask \(t_i\) doesn't need to be a single shot on goal. It can be 100 shots on goal, by different LLMs, whose results are ranked and evaluated by an ensemble of another 25 LLMs. A dynamic system that sways -- and self-corrects.

This "mixture of experts" style is what companies like OpenAI and Apple are already doing to create systems that perform better than any one of their individual parts.

Final thought 2: And that means the "Agents will work" scenario is rosier than I've implied

I went out of my way to paint a pretty bleak picture of how probabilities multiply to crater overall performance.

Using the model above, getting 90% accuracy on a 100-step task requires 99.98% average per-subtask accuracy. And I think it's important to sit back and say, "oh no.. GPT is good, but is it 99.98% good?"

But real world systems aren't a group of disconnected, one-shot subtasks:

  • Mixtures of experts can boost accuracy beyond per-LLM accuracy
  • Errors can be detected, and recovered from, after they've happened
  • Problem spots can be resolved over time with active learning
  • Humans can be tactically leveraged for specific, hard, subtasks

What that means is 99.98% doesn't need to be the average accuracy of any one LLM. It just has to be the average subtask accuracy of the system as a whole. That's still hard, but more within reach than might be implied from running the numbers alone.

Thanks

Thanks to Dave Kolas and James Murdza for feedback on this!

Notes

Footnotes

  1. "Agents" are software that use AI to make plans and carry them out in the real world. They're one of the holy grails of automation. You give them a high-level task like "plan a party", and they choose a date, find a spot, invite your friends, order pizza, and choose a playlist -- all without any guidance. That's the goal, anyway.

  2. This assumes the subtasks fail or succeed independently. For more realism, model the dependencies between sub-tasks, but then the math wouldn't fit on a napkin.

  3. Standard "beware averages" disclaimers apply. E.g. \([0.5, 0.5]\) is usually a terrible approximation of \([0.0, 1.0]\). For more realism, reduce the sub-tasks to clusters or cliques, not a single average. But then you wouldn't have a single-point answer to solve for.