Statistics and Compute
The key question for whether LLM capability plateaus
Look. The first thing I’m going to do is hook you. I’m going to tell you this is about AI and LLMs. You have to believe me that we’ll get there. If, along the way, this starts to look like it’s about time series regressions (“boring shit”), just trust me. Believe me. Keep the faith.
Computation and Statistics
Suppose we have a time-series regression (remember: trust me) where we don’t know the lag structure.
We find the “best” model for the time series in terms of out-of-sample prediction error by using the following computational procedure:
Fix an L, require that p cannot be greater than L, i.e. for every p greater than L, rp = 0.
For each of the 2L candidate models, compute the cross-validation score by leaving out future data and evaluating how well the model predicts it.
The reason we fix an L is that 2L gets awfully large awfully fast. Computationally, we just can’t evaluate that many models. L models our computational constraint. We have two possible scenarios:
The optimal model selected via this criterion has max{p} = L.
The optimal model selected via this criterion has max{p} < L.
In scenario 1, we are computationally constrained. If we could increase L and make our model more flexible, our predictions would improve. We just need to call up NVIDIA and get more, better chips.
In scenario 2, we are statistically constrained. The problem isn’t that we don’t have enough computational power. The problem is that, given the data available to us, the best models we can construct are less flexible. We need more data before we can improve the model. NVIDIA can’t help us. If we spend billions of dollars on compute, the return on it will be zip, zilch, nada.
LLMs
The key question of how much core AI capability will improve over the next few years is whether we are computationally or statistically constrained.
Until now, LLMs have been primarily computationally constrained. When we got better chips, when folks spent more on compute, the result was better models. But I think anyone who’s used the tech over the past few years has noticed the core technology leveling off.
The difference between GPT 3.5 and 4.0 was enormous. I recall asking 3.5 to explain statistical significance in a stakeholder-friendly way, and it responded with gibberish that was so bad it didn’t even rise to the level of being wrong—and to a simple, textbook question! 4.0 was so much better that it boggled the mind. But 5.x… It’s fine. I’m sure it’s better, but I hardly notice.
I have no inside info, but it does make me wonder whether we aren’t moving into the statistically constrained regime—or, at least, will be in the next couple of cycles. And if we are… it’ll be very difficult to improve the core model. Throwing more parameters and more compute into the mix won’t change the mathematics.
Unlike other technologies, the returns to more data are not increasing. They decrease rapidly. You need exponential increases in data to generate sublinear improvements in performance.
I think we’ll run into the statistical constraint sooner than the market’s priced in. Word prediction is a hard game. The difference between the right word and the wrong word is a big thing.
The difference between the almost right word and the right word is really a large matter—'tis the difference between the lightning-bug and the lightning.
Mark Twain
In math, the loss function is steep. Small errors are big errors. A steep loss function, combined with a very high dimensional, nonsmooth problem, implies we’ll need a lot of data. We’ll have to get it from somewhere.
I think I can see how it happens.
Skills
We can’t generate vast sums of additional, real data. The AI companies are already using as much as is plausibly available. So, we need to make better data, not just more. Structured data. We have to add single observations that are worth billions of unstructured rows. We need to generate data for LLMs rather than rely on data originally generated for other purposes.
This is where folks are going with “skills” and related ideas. Provide tight context in a structured format to give LLMs data that is worth a vast amount of unintentional data.
Going back to our time series example. Suppose we knew that there was a seasonal component that hit every four time units. So all the models that can’t capture that structure are right out. This tells us information that might take a long time series history to discover from the data alone.
In other words, skills are parameter restrictions. If the restrictions are true, they dramatically improve model performance.
In five years, I think the vast improvement in perceived LLM performance will come from these kinds of purpose-built skills, not from core model performance. The core model will soon run into statistical constraints, if it isn’t there already, but there will be skills that help it write better code, better emails, etc, by applying rules and logic that humans already know in a structured way.
Like in classical statistics, the way to learn more from a fixed data set is to make more credible assumptions about the data-generating process. Give the problem structure by applying what you know about how the world works, and stop requiring the data to learn everything.
The trick is to tell the data what to think when you know what it ought to think and to let it tell you what to think when you don’t.
Anyway, I think that’s why it’s going where it’s going.
Fewer chips, more context.
Thanks for reading!
Zach
Connect at: https://linkedin.com/in/zlflynn
Look at my website: https://zflynn.com

