New Things With Pseudo-True Estimators

Reading Club: Optimally-Transported GMM

Apr 17, 2026

Reading Club! April Edition!

I’m starting a new thing. I’m going to write about a recent paper that strikes my fancy once a month. More as a commitment device for myself to keep up with the lit than anything. I won’t just dive into the paper. I’ll talk about the relevant literature beforehand so the post is more or less self-contained.

This month’s paper is from a little-known, small journal from the fringes of Economics academia: “Econometrica.” It’s about figuring out how best to use the data to inform model estimates when we’ve got enough evidence to know the model’s “wrong.”

It’s one of those excuses you’ll hear. “I know the model’s wrong, but all models are wrong,” and you, the listener, who is both smarter and better-looking, think, “Well, yes, but, like, clearly there’s a limit to that. If it’s really wrong about important things, that matters!” Wouldn’t you know it? There is a whole literature about estimation when we know the model is sort of wrong. Here’s part of it and the links!

The Paper: March 2026, Econometrica: https://www.econometricsociety.org/publications/econometrica/2026/03/01/Optimally-Transported-Generalized-Method-of-Moments

Free ArXiv Link: https://arxiv.org/pdf/2511.05712

Generalized Method of Moments

To grok the paper, we have to start with the kind of estimation problem we’re looking at. It’s a Generalized Method of Moments problem, a very common method in economics, but the thing about statistics is that many fields teach it and they all do it a little differently—mostly by historical accident. I think I’d like this framework, even if I wasn’t brainwashed into liking it by my education, but I guess that counterfactual is unobserved… It’s a nice, flexible way to look at the empirical implications of a model.

Suppose we have a vector of parameters b. The Generalized Method of Moments (GMM) thinks about problems where the parameters are identified by “moment restrictions” like this:

\(E[g(X, b)] = 0\)

You can write a slew of common estimators like this.

OLS: E[X(Y - X’b)] = 0
IV: E[Z(Y - X’b)] = 0
Maximum Likelihood: E[ (dlog f/db)(X, b)] = 0
Quantile regression: E[X (p - 1(Y <= X’b))] = 0
Etc.

The idea is that Models produce moments, so by finding the model parameters that satisfy the moment conditions, we identify the version of our model that matches the data.

For another example, in an older post, I used GMM to build this estimator that reduces the variance of quantile treatment effects in experiments.

Usually, we estimate the model by solving some version of this problem:

\(\min_{b} \hat{E}[g(X_{i}, b)]^{\top} W \hat{E}[g(X_{i}, b)]\)

For some weighting matrix W. The estimator is consistent for any positive definite W, so the choice comes down to efficiency.

Overidentification

But sometimes models produce more moments than we need to estimate the parameters. This “overidentification” is useful because it allows us to use the extra moments to test whether the model is true. Intuition: a subset of the moment conditions pins down the model parameters, but we still have other moment conditions left over after fitting that we can plug the model parameters into to see whether all the restrictions can hold simultaneously.

But what if the model fails the test? What if it’s rejected by the data? It’s not terribly surprising that a model doesn’t hold exactly. So, how can we decide whether the model is wrong in the sense that it’s a simplification of a messy reality or if it’s wrong in the sense that we’re missing something critical?

One idea that emerged in the literature was: what if we could minimize the distance between our modeling assumptions and the data? We could capture some “pseudo-true” parameter, i.e., the most realistic parameter for the model given the data.

The parameter estimated by the GMM procedure above doesn’t really reflect that goal. It finds the parameter that minimizes the weighted norm of the moment conditions, which can have strong efficiency properties if the model is correctly specified. But if the model is wrong? Well, then the GMM estimator doesn’t really connect well with intuition about a parameter that minimizes the distance between the data and what the model requires. It’s just minimizing the norm of the moment conditions.

Empirical Likelihood

One idea in this vein is known as Empirical Likelihood, or EL. The empirical distribution assigns probability 1/n to each X(i), and we know this distribution doesn’t satisfy the moment conditions for any parameter b, but what if we take the mass points of the empirical distribution and tilt them until the moment conditions are satisfied for some b? What if we estimated the distribution of the data, subject to the constraint that it satisfies our model?

The empirical likelihood estimator does exactly like that. It maximizes the likelihood of the data subject to the moment condition:

\(\max \sum \log p_{i} \quad st: \quad \sum p_{i} g(X_{i}, b) = 0, p_{i} >= 0, \sum p_{i} = 1\)

Note that without the moment conditions, the solution to this problem is p(i) = 1/n, i.e., the empirical distribution. The non-negativity constraints can’t bind because the derivative of log approaches infinity as its argument approaches 0, so the first order condition is:

1/p(i) = lambda => all the p(i)’s are equal so p(i) = 1/n.

That means that if there were a b parameter that satisfied the moment conditions, we would just set p(i) = 1/n and not make any adjustments to the data. We’re effectively maximizing the log likelihood of the empirical distribution subject to the constraints of our model.

Even though there are n + dimension(b) number of parameters, there are nice tricks for computing the estimator from the first order conditions, i.e.

\(1/p(i) - \sum \mu_{k} g_{k}(X_{i}, b) = \lambda\)

So,

\(p_{i}(\lambda + \sum \mu_{k} g_{k}(X_{i}, b)) = 1. \)

Summing across i gives (from sum p(i) = 1 and sum p(i) g_k(X(i),b) = 0:

\(\lambda = n\)

For any given “b”, we can then solve for mu via the equations:

\(\sum g(X_{i},b) / (n + \mu’g(X_{i},b)) = 0\)

Define the solution as \mu(b).

So we have this lower-dimensional estimation problem:

\(\max -\log(n + \mu(b)’g(X_{i}, b))\)

Optimally-Transported GMM

EL is what the maximum likelihood estimator of the empirical distribution would be if we added our model’s constraints, but this isn’t the only way to get at this problem. What if, instead of looking for the closest distribution to the empirical distribution by adjusting probabilities, we changed the mass points?

\(\min_{Z,b} \sum ||X_{i} - Z_{i}||^{2} \quad st: \quad \sum g(Z_{i},b) = 0\)

So, now we’re keeping the empirical distribution's even weighting but perturbing the data points to make the moment conditions hold, minimizing the distance we travel to “transport” X to Z.

Like EL, this problem also has a convenient solution despite being a high-dimensional optimization problem, but this time based on successive linear approximation, see Algorithm 2.1 in the paper.

The estimator is intuitive. It finds a distribution of the data that minimizes change, as measured by a clear metric that satisfies the moment conditions. You get numbers for Z(i), so you can start asking questions like: if X(i) had to change that much, do I think I’m really on the right path in modeling this? Or: yeah, that seems like it’s just a little off—fine. It’s a nice way to quantify how bad a problem the misspecification is.

Empirical Likelihood can do something similar, but to me, anyway, I have a lot more intuition about variable-space than about probability weights—your mileage may vary. “You’re telling me I have to make revenue numbers twice as large… weird. I don’t know if I believe this. No, I don’t think I do.”

Downsides here are that I’m not sure E[||X-Z||^{2}] is really the loss function we have in mind when we’re thinking about minimizing how much we change covariates—I don’t think it’s crazy or anything, but I think we’re usually thinking about something a little less… “cardinal”, is maybe the word I’m looking for. There are some variables that I can’t imagine being mismeasured, so I’m suspicious of moving them around. The method doesn’t really depend on the loss function here, but once you’re out of the least squares-ish world, you lose the nice computational trick, and you’ve got an n x dimension(X) parameter to optimize over…

One idea I had here was to make the problem a little more specific and, based on that, identify more intuitive restrictions for the particular problem.

For example, the overidentified linear IV model:

\(E[W(Y - X’b)] = 0\)

And the misspecification we’re worried about is that some of the IV’s are invalid. So, write:

\(Y = X’b + U, \quad \text{where U is the structural residual.}\)

And so, we’ve got:

\(E[W(Y - X’b)] = E[WU] \ne 0\)

Maybe the misspecification problem is that U = E + U* where E is correlated with W but U* isn’t, so if instead the world had produced U* as the structural residual we wouldn’t have any misspecification. We could solve for the minimum “E” that solves the problem, like so:

\(\min_{E,b} \sum |E(i)|^{2} \quad st: \sum W_{i}(Y_{i} - X_{i}’b - E_{i}) = 0\)

So now we have a structural interpretation of the error. If I had to give any modeling advice… knowing what the error is is the first step toward wisdom…

Anyway, I think there are many other interesting applications of this idea and Empirical Likelihood, but I don’t think I’ve ever seen it used in industry, despite seeing many examples of overidentified IV models rejected by the data. If you’ve got a GMM problem and the time, it’s worth taking a look at these methods to get a sense of the size of the misspecification problem, and, if it’s small, using pseudo-true estimates that balance the model and the data.

Thanks for reading!

Zach

Connect at: https://linkedin.com/in/zlflynn

Under the Null

Discussion about this post

Ready for more?