On Myth, Confusion, and Science in Causal Analysis
Andrew Gelman (Columbia) recently wrote a blog post motivated by Judea Pearl's paper, "Myth, Confusion, and Science in Causal Analysis. " In response, Pearl writes:
Dear Andrew,
Thank you for your blog post dated July 5. I appreciate your genuine and respectful quest to explore the differences between the approaches that I and Don Rubin are taking to causal inference.
In general, I would be the first to rally behind your call for theoretical pluralism (e.g., "It make sense that other theoretical perspectives such as Pearl's could be useful too.") We know that one can prove a theorem in geometry by either geometrical or algebraic methods, depending on the problem and the perspective one prefers to take–only the very dogmatic would label one of the methods "unprincipled".
My article, "Myth, confusion and Science in Causal Analysis", is written with this dual perspective in mind, fully accommodating the graphical and potential-outcome conceptualizations as interchangeable, "A theorem in one approach is a theorem in another," I wrote.
However, when adherents of the one-perspective approach make claims that mathematically contradict those derived from the dual-perspective approach, one begins to wonder whether there is something more fundamental at play here.
In our case, the claims we hear from two adherents of the graph-less one-perspective school is: "there is no reason to avoid adjustment for a variable describing subjects before treatment." And from three adherents of the graph-assisted dual-perspective school we hear: "Adjustment for a variable describing subjects before treatment may be harmful."
This is a blatant contradiction that affects every observational study and deserves therefore to be discussed even if we believe in "let one thousand roses bloom."
One may be tempted to resolve the contradiction by appealing to practical expediencies. For example,
- Nothing is black and white.
- Perhaps adjustment may be harmful in theory, but is very rare in practice,
- Perhaps the harm is really very small, or
- we do not really know in practice if it is harmful or not, so why worry?
This line of defense would be agreeable, were it not accompanied with profound philosophical claims that the dual-perspective approach is in some way "unprincipled" and standing (God forbid) "contrary to Bayesianism."
The point is that we DO KNOW in practice when harm is likely to occur through improper adjustments. The same subjective knowledge that tells us that seat-belt usage does not cause smoking or lung disease also tells us that adjustment for seat-belt usage is likely to introduce bias.
Moreover, one can derive this warning in the graph-less notation of potential outcome. So, the question remains: why haven't potential outcome scholars been issuing that warning to their students?
The conjecture I made should concern every Bayesian and every educator, for it points beyond M-bias and covariate selection. The conjecture is that the language of "potential outcome" and "ignorability" discourages investigators from articulating and using valuable knowledge which they possess, for example, that seat-belt usage does not cause smoking. Do you know of any study where such a piece of knowledge was used in determining whether treatment assignment is "ignorable" or not? My conjecture is confirmed by potential-outcome practitioners who admit to be using "ignorability" invariably to justify their favorite method of analysis, never as an object to be justified by appeal to causal knowledge.
As to indiscriminate conditioning in Bayesian philosophy, the example of controlling for an intermediate variable (between treatment and outcome) should illuminate our discussion. (I do not buy your statement that bias is "tricky to define." It is extremely easy to define, even in Rubin's notation: "Bias" is what you get if you adjust for Z and treatment assignment is not ignorable conditioned on Z. This would suffice for our purposes)
You say:
- A Bayesian analysis can control for intermediate outcomes–that's okay–but then…
- Jennifer and I recommend not controlling for intermediate outcomes.
- You can control for anything, you just then should suitable post process….
- I heard Don Rubin make a similar point… Fisher made this mistake.
Andrew, I know you did not mean it to sound so indecisive, but it does. Surely, one can always add 17.5 to any number, as long as one remembers to "post-process" and correct the mistake later on. But we are not dealing here with children arithmetic. Why not say it upfront: "You cant arbitrarily add 17,5 to a number and hope you did not do any harm." Even the Mullahs of arithmetic addition would forgive us for saying it that way.
If you incorporate an intermediate variable M as a predictor in your propensity score and continue to do matching as if it is just another evidentiary predictor, no post processing will ever help you, except of course, redoing the estimation afresh, with M removed.It will not fix itself by taking more samples. Is Bayesianism so dogmatic as to forbid us from speaking plainly and just say : "Don't condition". (No wonder I once wrote: "Why I am only a half-Bayesian".)
True, the great R A Fisher made a similar mistake. But it happened in the context of estimating "direct effects", where one wants to control for the intermediary variable, not in the context of"causal effects," where one wants the intermediaries to vary freely. Incidentally, the repair that Don Rubin offered in the Fisher lecture made things even worse. For example, the direct effect according to Rubin's definition (using principal stratification) is definable only in units absent of indirect effects. This means that a grandfather would be deemed to have no direct effect on his grandson's behavior in families where he has some effect on the father. In linear systems, to take a sharper example, the direct effect would be undefined whenever indirect paths exist from the cause to its effect.
Such paradoxical conclusions emanating from a one-perspective culture underscore the wisdom, if not necessity of a dual-perspective analysis, in which the counterfactual notation Yx(u) is governed by the formal semantics of graphs, structural equations and open-mindedness.
I just saw Larry Wasserman's comment. Larry is right, I do not operate in an "entirely different conceptual framework." I call the [X x Y] –> [0,1] function P(Yx = y) "causal effect" leaving it up to the investigator to form differences P(Y1 = y) – P(Y0 = y), P(Y8 = 3) – P(Y5 = 3), or ratios: P(Y1 = y) / P(Y0 = y) or any other comparison that fits fashion and dogma.
This does not make for a different conceptual framework, it is the common engineering practice of not wasting precious symbols on trivialities. What does call for a possible realignment of conceptual frameworks is what you tell your students about adjustment for intermediaries, and whether big-brother Bayes approves.
Try it.
Best,
Judea
Update 1: Please note that a parallel discussion is also underway on Gelman's blog. You may read the
comments by clicking here .
Having just read through this fascinating interchange, I confess to finding Shrier and Pearl's examples and arguments more convincing that Rubin's. At the risk of adding to the confusion, but also in hope of helping at least some others, let me briefly describe yet another way (related to Pearl's, but with significant differences) of formulating and thinking about the problem. For those who, like me, may be concerned about the need to consider the probabilistic behaviour of counterfactual variables, on the one hand, or deterministic relationships encoded graphically, on the other, this provides an observable-focused, fully stochastic, alternative. A full presentation of the essential ideas can be found in Chapters 9 (Confounding and Sufficient Covariates) and 10 (Reduction of Sufficient Covariate) of my online document "Principles of Statistical Causality" <http://www.ucl.ac.uk/Stats/research/reports/psfiles/rr279.pdf >. Like Pearl, I like to think of "causal inference" as the task of inferring what would happen under a hypothetical intervention, say FE = e, that sets the value of the exposure E at e, when the data available are collected, not under the target "interventional regime", but under some different "observational regime". We could code this regime as FE = idle. We can think of the non-stochastic variable FE as a parameter, indexing the joint distribution of all the variables in the problem, under the regime indicated by its value. It should be obvious that, even to begin to think about the task of using data collected under one regime to infer about the properties of another, we need to make (and should attempt to justify!) assumptions as to how the regimes are related. Supose the response of interest is Y, and we also measure additional variables X (all symbols may represent collections of variables). We call X a sufficient covariate when we can assume that the following two conditions hold: 1. X ind FE 2. Y ind FE | (X, E) Here (in the absence of the special symbol for independence) A ind B | C denotes that A is independent of B given C, or, equivalently, that p(a | b, c) does not depend on the value b of B (for given a,c). Note that this makes sense even if (as in 1 and 2) B is a parameter variable rather than a random variable. We can handle such "extended conditional independence" (ECI) properties using exactly the same algebraic rules as for regular probabilistic conditional independence (CI). And, if desired, we can use graphical representations (which explicitly include parameter variables along with random variables) to represent and manipulate ECI properties, exactly as for CI. The graph representing 1 and 2 would have arrows from FE to E, from X to E and to Y, and from E to Y. Assumption 1 says that the distribution of X is the same in all regimes, be they interventional or observational: this may well be reasonable if X is a "pre-treatment" variable. More important, Assumption 2 says that the distribution of Y, given both E and X, is the same in all regimes: that is to say, we do not need to know whether the value of E arose by intervention or "naturally": this conditional distribution is a stable "modular component" that can be learned from the observational regime (so long as we can observe X as well as E and Y), and then transferred to the interventional regime. Even if we restrict to pre-treatment variables, this is a strong additional condition, that may hold for some (non-unique) choices, and fail for others. In particular, if we do have a sufficient covariate, there is no reason that this property should be preserved when we add or subtract components of X. Bayesianism has nothing to do with it. When — and to a large extent only when — X is a sufficient covariate in the above sense does it make causal sense to "adjust for" X (e.g. by applying Pearl's "back-door" formula). An interesting question is "When can we reduce X?", i.e. find a non-trivial function V of X that is itself a sufficient covariate, so simplifying the adjustment task. One easily verified case is when V is the propensity score based on X, in which case E ind X | (V, FE) (though of course if X is NOT itself initially sufficient, then nor typically will be V). Another is when the (modular) distribution of Y given X and E in fact only depends on V and E. There are some parallels between the concept of covariate sufficiency and Fisher's concept of a sufficient statistic, but also important differences. In particular, if we have identified two different sufficent covariates, V and W, there need be no way to combine them: neither their union (V,W), nor the information Z common to both of them, need be a sufficient covariate. The required properties simply do not follow from Assumptions 1 and 2, and counterexamples are readily provided. To turn to Shrier's "M-bias" example, we can turn Figure 1 of his original letter (doi: 10.1002/sim.3172) into a graphical representation of ECI properties simply by adding an additional parameter node FE and an arrow from FE to E. The graph then encodes, by d-separation, Assumptions 1 and 2, where Y is "outcome", and X is, alternatively, either U1 or U2. Thus each of U1 and U2 is a sufficient covariate (as, in this special case, is the information common to them both—which is null). But although Assumption 1 holds for X = C, Assumption 2 for X = C is NOT a consequence of d-separation, and does not follow from the assumptions made: so there is no reason to expect C to be a sufficient covariate — and it typically will not be. In the absence of sufficiency, we can expect adjustment to lead to a mismatch between the quantity estimated in the observational regime and the target causal quantity of the interventional regime — which is my interpretation of the term "bias".
Comment by Philip Dawid — July 7, 2009 @ 6:32 am
Judea, I agree with you but would like to offer a minor correction: Your letter quotes Rosenbaum as saying “there is no reason to avoid adjustment for a variable describing subjects before treatment”. Rosenbaum's exact words were: "In principle, there is little or no reason to avoid adjustment for a true covariate, a variable describing subjects before treatment." I think you really need to restore "little or".
Comment by L — July 11, 2009 @ 1:23 am
I was wondering if you ever considered changing the structure
of your website? Its very well written; I love what youve got to say.
But maybe you could a little more in the way of content so people could connect
with it better. Youve got an awful lot of text for only having one or 2 images.
Maybe you could space it out better?
Here is my weblog הביקורת שלי כאן
Comment by הביקורת שלי כאן — August 25, 2014 @ 11:49 pm
They can also provide you the total security systems
installation for your residential or commercial purposes.
1940 Tigertail Blvd, Unit 2 Fort Lauderdale, FL 33004. Are you looking
for an effective way to protect your family and your home.
my blog post קורס מנעולן מוסמך בפתח תקוה (Joe)
Comment by Joe — November 3, 2014 @ 12:37 pm