A workhorse method for measuring censored relationships
Without making distributional assumptions!
We’re going to talk mainly about the general problem of understanding censored relationships, but specifics make generality easier to grok…
Suppose we want to understand how the time until a Free/Basic Plan customer signs up for the Premium/Ultra/Mega Package varies with different actions or incentives the customer takes or receives. For customers that do eventually sign up for Premium, we observe the time it took them to do so from when they first signed up for the Free Plan. But for customers that haven’t signed up for Premium, the only thing we know about their Time to Premium is that it is greater than however long it’s been so far, i.e. we only have a lower bound on their Time to Premium. We’re missing data.
Most of causal inference is about making assumptions to fill in missing data. We don’t know what so-and-so would have done if they hadn’t been treated, so we find people who weren’t treated and say they’re kind of like so-and-so and fill in the treated subject’s missing outcome with theirs. In censored problems, like this one, we come up with an assumption about what the censored data would look like.
The classical methods (recall that “classical” is math-speak for “lame”) use parametric assumptions—impose a functional form on the censored variable’s distribution—to make the censored variable informative about the full distribution.
The classical example looks like this:
Where we only observe y* = max{y, 0} and y is unobserved. b and s are the parameters to estimate.
The parametric assumption, that y is normally distributed, allows the uncensored data to be informative about the distribution of the censored data because we know:
And we can identify the parameters of this distribution just from data on (y*, x). The normality of e allows us to connect the error from the uncensored data to what the error would be on the censored data. But this is a very strong assumption and we’re essentially letting functional form tell us about the extent of the censoring problem. Not very satisfying!
What is the most we can identify without this kind of functional form assumption?
What can we learn about how the distribution of a latent, censored variable changes with a covariate without making distributional assumptions about what we can’t see?
Fundamentally, the question of how the distribution of y changes with x is a question about how quantiles change. So, write:
f is the conditional quantile of y given x.
Let v be a uniform random variable drawn on (0,1). Then, here’s a random variable that has the same distribution as y given x = x:
So, if we can identify this quantile function, we’ll know a lot about how x affects y.
The problem is that we don’t observe y, so we can’t just estimate the model using our favorite quantile regression method.
Instead, we observe y* = max{y, 0}.
What we’re going to do is show that for a certain set of (x,t) we can identify f(t,x). Maybe that set is good enough that we don’t need to make any other assumptions. We can answer whatever question we’re asking the data just from identified points of the quantile function. If we need more, then we can extrapolate, i.e. make the dirty parametric assumptions, but we can do so, starting from points that are identified without having to make such compromises.
Whenever I’m thinking about how to identify something, I start with the distribution function. The reason to start there is that it fully characterizes the information available in the data. So it gives you the most identifying power to play with.
The data we have available to us is (y*, x). So let’s do a little algebra and see what y* given x tells us.
Let p(x) = pr[f(x,t) + e(t) ≥ 0 | x] be the probability of not being censored, conditional on x. The nice thing here is that we can estimate this function from the data because we know whether each observation is censored.
Let C be the event of an observation being censored. Then we can write the above as.
At all points on y*’s support, u is nonnegative because y* is nonnegative, so:
Let’s substitute in the true quantile function of y for u here:
Because f is the conditional quantile function, we know:
So substitute for that term in the conditional CDF of y* expression:
Now, we’re cooking. We’ve got the true quantile that we want (t) plus a bias term introduced by the censoring. It turns out that we can actually identify the set of (x,t) where the bias term is 0.
Suppose that for a certain (x,t): p(x) = 1 - t + c where c > 0.
Because:
We must have that f(t,x) > 0 because c > 0.
If the data is censored, then y < 0. Because f(t,x) > 0, the probability of y being less than f(t,x) given that y is censored is 100%. So:
So, for any (x,t) with sufficient probability of not being censored, we can identify the conditional quantile of y given x just from the conditional quantile of y* given x.
Cool. What does that buy us? This tells us the set of quantiles and covariate variables for which we do not face a real censoring problem and can identify the quantile function directly from data on (y*, x).
Unfortunately, this will usually not be enough. We’ll need to know how the quantile function responds at other (x,t) besides the ones that aren’t identified.
My preferred method for doing this is to do one of the following, depending on what you need: simply parameterize f(x,t; r) with a vector of parameters r. This form of parametric assumption is more natural and easier to motivate than assumptions about the functional forms of residual densities because you’re choosing how a response function varies as the quantile increases and as covariates change, and the shape is strongly informed by the actually-identified points instead of being a functional form assumption without strong justification, i.e. in a simplified example: suppose you had the following point-identified points and one non-identified point:
It seems sort of reasonable to think it’s roughly linear with x in this example so that x=4 should have a median around 4. The point is that you can use the nonparametrically-identified shapes to inform what parametric restrictions you put on the problem to identify the rest.
Anyway, this is just a tool that I think is less well-known than it deserves to be. So, I’m sharing it. Enjoy!
Thanks for reading!
Zach
Connect at: https://linkedin.com/in/zlflynn

