Estimating Logistic Regressions with Linear Regressions

Aug 05, 2024

In 2017 — how did that become 7 years ago?! — I wrote up this little estimator for logistic regression that is very stable, guaranteed to converge, and easy to extend to non-standard setups. I never really put it out anywhere, so I decided to add it to the blog!

When you run logistic regressions (especially with lots of covariates), you sometimes run into numerical problems, even though the log-likelihood is concave, because some regions of the parameter space are almost not concave, etc. IRLS or other methods can end up stalling and stop far away from the true estimator.

This method doesn’t have that problem because it’s very stable numerically. It’s also just a flexible “platform” to stick other assumptions onto with a logit model because it’s based on a sequence of linear probability models.

Repeated Linear Regression Estimator

Let y be an outcome variable with values {0,1} and suppose:

pr[y = 1 | x] = 1 / (1 + exp(-r(x)’b))

We want to estimate b (r is a vector of known transformations of x).

Define the parameter c as:

y = r(x)’c + e, E[r(x)e] = 0

i.e. c are the coefficients of the linear probability model using the same controls as in the logistic regression model.

Define the related parameter c(b) for a given parameter vector b as:

1 / (1 + exp(-r(x)’b)) = r(x)’c(b) + e(b), E[r(x)e(b)] = 0

The Repeated Linear Regression algorithm does the following:

Estimate c by OLS.
Set k = 0 and let b(0) be some initial guess for the logistic parameters.
Estimate c(b(k)) by OLS.
Set b(k+1) = b(k) + c-c(b(k)) and k := k + 1.
Repeat 3–4 until ||b(k+1) — b(k)|| is small.

So, we can already see some nice computational advantages to the estimator. For example, there are (numerically) stable estimators for OLS, so we don’t need to worry as much about ending up in some odd part of the parameter space and our solver getting stuck.

Why does this work?

Define p(b,x) = 1/(1+exp(-r(x)’b)) and let B be the parameter space:

We need the following technical condition (in practice, ignoring this is usually fine):

min { p(b,x) : x, b in B } ≥ p’ > 0
max { p(b,x) : x, b in B } ≤ p’’ < 1

Define T(b) = b + c – c(b).

First, at the true value b* we know that c = c(b*) and therefore, b* is a fixed point of T. To see this, apply the law of iterated expectations:

c = E[rr’]^-1 E[ry] = E[rr’]^-1 E[rE[y | x]] = c(b*)

So we know b* is a fixed point of T, but is it the only one?

A little bit of algebra shows:

DT(b) = (E[rr’]^-1) E[rr’ (1 — p(1-p))]

Suppose m is an eigenvalue of DT(b). Then for some v != 0 :

(E[rr’]^-1) E[rr’ (1 — p(1-p))]v = m v

E[rr’(1-p(1-p))] v = m E[rr’]v

E[rr’(1-p(1-p) — m)] v = 0

Suppose m ≥ 1. Then, (1-p(1-p) — m) < 0 at all x. Because this implies that E[(m + p(1-p) — 1) rr’] is strictly positive definite (if r is not perfectly collinear), that means it is invertible. Therefore, v = 0, a contradiction.

For a similar reason, suppose m ≤ 0. Then (1 — p(1-p) — m) > 0 at all x. So, then the matrix is invertible as well and v = 0, a contradiction.

So, all eigenvalues of DT are between (0,1) => the (spectral) norm of DT is less than 1. The assumptions 1–2 above are needed to make sure we can’t be arbitrarily close to 0 or 1.

Therefore, T is a contraction mapping which implies that T has a unique fixed point. Since we know that b* is a fixed point, then it must be the unique fixed point.

Because T is a contraction mapping, we also know an algorithm to find the fixed point (see link above) which is what we used to develop the Repeated Linear Regression algorithm.

How the estimator differs from Maximum Likelihood

The method is actually a method of moments estimator in disguise. It is equivalent to the following moment conditions:

E[r(x) (y — 1/(1+exp(-r(x)’b)))] = 0

So, for standard errors, we can just use standard formulas from the GMM literature based on these moment conditions.

The estimator is less efficient statistically than maximum likelihood, but it has attractive computational advantages whenever peak efficiency is less of a concern (relatively large sample size).

Endogeneity

A nice thing about this setup is that it is easy to extend to modifications of basic logistic regression because we get a lot of tools with linear regression models. For example…

When the parameters b have some structural meaning, they might be endogenous: non-x factors affecting the probability y = 1 are correlated with the x’s.

Here’s what an instrumental variable approach might look like in this environment:

pr(y = 1 | z) = E[1 / (1 + exp(-r(x)’b)) | z]

Where z is some vector of instruments. Let w(z) be transformations of these instruments such that the following assumption holds:

Tilting Rank Condition: Suppose u(*) is scalar-valued and strictly positive for all values of its arguments. Then: E[u(x) w(z) r(x)’] is invertible for any such u(*). Note that this condition holds whenever w = r and z = x (which is why we didn’t need it in the exogenous case above).

If the Tilting Rank Condition holds, then we can use the exact same algorithm to estimate this model, replacing the OLS steps with, naturally, the two-stage least squares estimator using w(z) as the instruments.

Use Cases

So, what can we use this for? It’s a neat estimator that is useful in the following scenarios:

You’ve got a constant feed of arbitrary data and you automatically retrain the model. You want to make sure this process never fails, or gets stuck at some non-optimum point and you can’t realistically do manual checks to get unstuck, etc.
You want to find convenient starting values for maximum likelihood so that you’re less likely to run into convergence issues.
You have a structural binary choice model and you need an easy way to use instrumental variable methods.

Enjoy!

Zach

Connect at: https://linkedin.com/in/zlflynn/!

If you want my help with any Experimentation, Analytics, etc. problem, click here.

Under the Null

Discussion about this post

Ready for more?