Causal Analysis in Theory and Practice

July 16, 2009

Resolution is Fast Approaching: Discussion with Andrew Gelman

Filed under: Discussion,Propensity Score — judea @ 2:00 pm

Judea Pearl's exchange with Andrew Gelman about Donald Rubin's approach to propensity scores.

Dear Andrew,

In your last posting you have resolved the clash between Rubin and Pearl — many thanks.

You have concluded: "Conditioning depends on how the model is set up,” which is exactly what I have been arguing in the last five postings.

But I am not asking for credit. I would like only to repeat it to all the dozens of propensity-score practitioners who are under the impression that Rubin, Rosenbaum and other leaders are strongly in favor of including as many variables as possible in the propensity score function, especially if they are good predictors of the treatment assignment.

Let me quote you again, lest it did not reach some of those practitioners:

"They (Rubin, Angrist, Imbens) don't suggest including an intermediate variable as a regression predictor or as a predictor in a propensity score matching routine, and they don't suggest including an instrument as a predictor in a propensity score model." (Gelman posting 2009)

Our conclusion is illuminating and compelling:

When Rosenbaum wrote: "there is little or no reason to avoid adjustment for a true covariate, a variable describing subjects before treatment" (Rosenbaum, 2002, p. 76], he really meant to exclude instrumental variables, colliders and perhaps other nasty variables from his statement.

And when Rubin wrote (2009):

"to avoid conditioning on some observed covariates,… is nonscientific ad hockery."

he really did not mean it in the context of propensity-score matching (which was the topic of his article.)"

And when Gelman wrote (in your first posting of this discussion):

"For example, we [Gelman and Hill] recommend that your model should, if possible, include all variables that affect the treatment assignment" he (Gelman) really meant to exclude variables that affect the treatment assignment if they act like instruments.

(which, if we look carefully at the reason for this exclusion, really means to exclude almost ALL variables that affect treatment assignment). And when Rubin changed the definition of "ignorability" (2009) "to be defined conditional on all observed covariates" he really meant to exclude colliders, instrumental variables and other trouble makers, he simply did not bother to tell us that (1) some variables are trouble makers, and (2) how to spot them.

If you and I accept these qualifications, and if you help me get the word out to those poor practitioners, I don’t mind it if you tell them that all these exceptions and qualifications are well known in the potential-outcome subculture and that these prove that Pearl's approach was wrong all along. But, please get the word out to those poor propensity- score practitioners, because they are conditioning on everything they can get their hand on.

I have spoken to many of them, and they are not even aware of the problem.

They follow Rubin's advice, and they are scared to be called "unprincipled" — I am not.

Andrew Gelman replies:

I agree with you that the term "unprincipled" is unfortunate, and I hope that all of us will try to use less negative terms when describing inferential approaches other than ours.

Regarding your main point above, it does seem that we are close to agreement. Graphical modeling is a way to understand the posited relations between variables in a statistical model. The graphical modeling framework does not magically create the model (and I'm not claiming that you ever said it did), but it can be a way for the user to understand his or her model and to more easily to communicate it to others.

I think you're still confused on one point, though. It is not true that Rubin and Rosenbaum "really meant to exclude colliders, instrumental variables and other trouble makers." Rubin specifically wants to include instrumental variables and other "trouble makers" in his models–see his 1996 paper with Angrist and Imbens. He includes them, just not as regression predictors.

I agree with you that Rubin would not include an instrument or an intermediate outcome in the propensity score function, and it is unfortunate if people are doing this. But he definitely recommends including instruments and intermediate outcomes in the model in an appropriate way (where "appropriate" is defined based on the model itself, whether set up graphically (as you prefer) or algebraically (as Rubin prefers).

Judea Pearl replies:

We are indeed close to a resolution.

Let us agree to separate the resolution into three parts:

  1. Propensity score matching
  2. Bayes analysis
  3. Other techniques

1. Propensity score matching. Here we agree (practitioners, please listen) that one should screen variables before including them in the propensity-score function. Because some of them can be trouble-makers, name, capable of increasing bias over and above what it would be without their inclusion, and some are guaranteed to increase that bias.

1.1 Who are those trouble makers, and how to spot them, is a separate question that is a matter of taste. Pearl prefers to identify them from the graph and Rubin prefers to identify them from the probability distribution P(W|X, Y0,Y1) which he calls "the science",

1.2 We agree that once those trouble makers are identified, they should be excluded (repeat: excluded) from entering the propensity score function, regardless of how people interpreted previous statements by Rubin (2007, 2009), Rosenbaum (2002) or other analysts.

2. Bayes analysis. We agree that, if one manages to formulate the "science" behind the problem in the form of constraints over the distribution P(W|X, Y, Y0,Y1) and load it with appropriate priors, then one need not exclude trouble makers in advance; sequential updating will properly narrow the posteriors to reflect both the science and the data. One such exercise is demonstrated in section 8.5 of Pearl's book Causality, which purposefully include an instrumental variable to deal with Bayes estimates of causal effects in clinical trials with non-compliance.(Mentioned here to allay any fears that Pearl is "confused" about this point, or is unaware of what can be done with Bayesiam methods)

2.1 Still, if the "science" proclaims certain covariates to be "irrelevant", there is no harm in excluding them EVEN FROM a BAYESIAM ANALYSIS, and this is true whether the "science" is expressed as distribution over counterfactuals (as in the case of Rubin) or as a graph, based directly on the subjective judgments that are encoded in the "science". There might actually be benefits to excluding them, even when measurement cost is zero.

2.2 Such irrelevant variables are, for example,
colliders, and certain variables    affected by the treatment, e.g., Cost<—Treatment —-> Outcome.

2.3 The status of intermediate variables (and M-Bias) is still in the open. We are waiting for detailed analysis of examples such as Smoking —>Tar—>Cancer with and without the Tar.  There might be some computational advantages    to including Tar in the analysis, although the target causal effect (of smoking on cancer) is insensitive to Tar if Smoking is randomized.

3. Other methods. Instrumental variables, intermediate variables and confounders can be identified, and harnessed to facilitate effective causal inference using other methods, not involving propensity score matching or Bayes analysis. The measurement of Tar, for example (see example above)  can be shown to enable a consistent estimate of the causal effect (of Smoking on Cancer) even in the presence of confounding factors affecting both smoking and cancer (page 81-84 of Causality).

Shall we both sign on this resolution? 

December 6, 2008

Remarks on the Method of Propensity Score

Filed under: Discussion,Opinion,Propensity Score — judea @ 5:00 pm

A letter from Judea Pearl to the Editor of Statistics in Medicine:

Dear Editor,

I read with great interest Donald Rubin’s paper "The design versus the analysis of observational studies for causal effects: Parallels with the design of randomized trials" (2007, 26(1):20-36) [1], as well as the Letter To The Editor by Ian Shrier (2008, 27(14):2740-2741) [2], and Author Reply by Don Rubin (2008, 27(14):2741-2742) [3].

Shrier’s Letter posed an important question which remains unanswered in Rubin’s reply. I here venture to answer this question and to clarify related issues concerning the interpretation of propensity scores (PS) and their role in causal inference.

Shrier’s question was whether, asymptotically, the use of PS methods as described by Rubin may actually increase, not decrease, bias over and above a crude, unadjusted comparison between treated and untreated subjects. The answer is: Yes, and the M-graph cited by Shrier (see also [4, 5]) provides an extreme such example; the crude estimate is bias-free, while PS methods introduce new bias.

This occurs when treatment is strongly ignorable to begin with and becomes non-ignorable at some levels of e. In other words, although treated and untreated units are balanced in each stratum of e, the balance only holds relative to the covariates measured; unobserved confounders may be highly unbalanced in each stratum of e, capable of producing significant bias. Moreover, such imbalance may be dormant in the crude estimate and awakened throughthe use of PS methods.

There are other features of PS methods that are worth emphasizing.

First, the propensity score e is a probabilistic, not a causal concept. Therefore, in the limit of very large sample, PS methods are bound to produce the same bias as straight stratification on the same set of measured covariates. They merely offer an effective way of approaching the asymptotic estimate which, due to the high dimensionality of X, is practically unattainable with straight stratification. Still, the asymptotic estimate is the same in both cases, and may or may not be biased, depending on the set of covariates chosen.

Second, the task of choosing a sufficient (i.e., bias-eliminating) set of covariates for PS analysis requires qualitative knowledge of the causal relationships among both observed and unobserved covariates. Given such knowledge, finding a sufficient set of covariates or deciding whether a sufficient set exists are two problems that can readily be solved by graphical methods [6, 7, 4].

Finally, experimental assessments of the bias-reducing potential of PS methods (such as those described in Rubin, 2007 [1]) can only be generalized to cases where the causal relationships among covariates, treatment, outcome and unobserved confounders are the same as in the experimental study. Thus, a study that proves bias reduction through the use of covariate set X does not justify the use of X in problems where the influence of unobserved confounders may be different.

In summary, the effectiveness of PS methods rests critically on the choice of covariates, X, and that choice cannot be left to guesswork; it requires that we understand, at least figuratively, what relationships may exist between observed and unobserved covariates and how the choice of the former can bring about strong ignorability or a reasonable approximation thereof.

Judea Pearl


  1. Rubin D. The design versus the analysis of observational studies for causal effects: Parallels with the designof randomized trials. Statistics in Medicine 2007; 26:20–36.
  2. Shrier I. Letter to the editor. Statistics in Medicine 2008; 27:2740–2741.
  3. Rubin D. Author’s reply (to Ian Shrier’s Letter to the Editor). Statistics in Medicine 2008; 27:2741–2742.
  4. Greenland S, Pearl J, Robins J. Causal diagrams for epidemiologic research. Epidemiology 1999; 10(1):37–48.
  5. Greenland S. Quantifying biases in causal models: Classical confounding vs. collider-stratification bias.pidemiology 2003; 14:300–306.
  6. Pearl J. Comment: Graphical models, causality, and intervention. Statistical Science 1993; 8(3):266–269.
  7. Pearl J. Causality: Models, Reasoning, and Inference. Cambridge University Press: New York, 2000.

Powered by WordPress