Causal Analysis in Theory and Practice

July 9, 2016

The Three Layer Causal Hierarchy

Filed under: Causal Effect,Counterfactual,Discussion,structural equations — bryantc @ 8:57 pm

Recent discussions concerning causal mediation gave me the impression that many researchers in the field are not familiar with the ramifications of the Causal Hierarchy, as articulated in Chapter 1 of Causality (2000, 2009). This note presents the Causal Hierarchy in table form (Fig. 1) and discusses the distinctions between its three layers: 1. Association, 2. Intervention, 3. Counterfactuals.


June 28, 2016

On the Classification and Subsumption of Causal Models

Filed under: Causal Effect,Counterfactual,structural equations — bryantc @ 5:32 pm

From Christos Dimitrakakis:

>> To be honest, there is such a plethora of causal models, that it is not entirely clear what subsumes what, and which one is equivalent to what. Is there a simple taxonomy somewhere? I thought that influence diagrams were sufficient for all causal questions, for example, but one of Pearl’s papers asserts that this is not the case.

Reply from J. Pearl:

Dear Christos,

From my perspective, I do not see a plethora of causal models at all, so it is hard for me to answer your question in specific terms. What I do see is a symbiosis of all causal models in one framework, called Structural Causal Model (SCM) which unifies structural equations, potential outcomes, and graphical models. So, for me, the world appears simple, well organized, and smiling. Perhaps you can tell us what models lured your attention and caused you to see a plethora of models lacking subsumption taxonomy.

The taxonomy that has helped me immensely is the three-level hierarchy described in chapter 1 of my book Causality: 1. association, 2. intervention, and 3 counterfactuals. It is a useful hierarchy because it has an objective criterion for the classification: You cannot answer questions at level i unless you have assumptions from level i or higher.

As to influence diagrams, the relations between them and SCM is discussed in Section 11.6 of my book Causality (2009), Influence diagrams belong to the 2nd layer of the causal hierarchy, together with Causal Bayesian Networks. They lack however two facilities:

1. The ability to process counterfactuals.
2. The ability to handle novel actions.

To elaborate,

1. Counterfactual sentences (e.g., Given what I see, I should have acted differently) require functional models. Influence diagrams are built on conditional and interventional probabilities, that is, p(y|x) or p(y|do(x)). There is no interpretation of E(Y_x| x’) in this framework.

2. The probabilities that annotate links emanating from Action Nodes are interventional type, p(y|do(x)), that must be assessed judgmentally by the user. No facility is provided for deriving these probabilities from data together with the structure of the graph. Such a derivation is developed in chapter 3 of Causality, in the context of Causal Bayes Networks where every node can turn into an action node.

Using the causal hierarchy, the 1st Law of Counterfactuals and the unification provided by SCM, the space of causal models should shine in clarity and simplicity. Try it, and let us know of any questions remaining.


June 20, 2016

Recollections from the WCE conference at Stanford

Filed under: Counterfactual,General,Mediated Effects,structural equations — bryantc @ 7:45 am

On May 21, Kosuke Imai and I participated in a panel on Mediation, at the annual meeting of the West Coast Experiment Conference, organized by Stanford Graduate School of Business The following are some of my recollections from that panel.

We began the discussion by reviewing causal mediation analysis and summarizing the exchange we had on the pages of Psychological Methods (2014)

My slides for the panel can be viewed here:

We ended with a consensus regarding the importance of causal mediation and the conditions for identifying of Natural Direct and Indirect Effects, from randomized as well as observational studies.

We proceeded to discuss the symbiosis between the structural and the counterfactual languages. Here I focused on slides 4-6 (page 3), and remarked that only those who are willing to solve a toy problem from begining to end, using both potential outcomes and DAGs can understand the tradeoff between the two. Such a toy problem (and its solution) was presented in slide 5 (page 3) titled “Formulating a problem in Three Languages” and the questions that I asked the audience are still ringing in my ears. Please have a good look at these two sets of assumptions and ask yourself:

a. Have we forgotten any assumption?
b. Are these assumptions consistent?
c. Is any of the assumptions redundant (i.e. does it follow logically from the others)?
d. Do they have testable implications?
e. Do these assumptions permit the identification of causal effects?
f. Are these assumptions plausible in the context of the scenario given?

As I was discussing these questions over slide 5, the audience seemed to be in general agreement with the conclusion that, despite their logical equivalence, the graphical language  enables  us to answer these questions immediately while the potential outcome language remains silent on all.

I consider this example to be pivotal to the comparison of the two frameworks. I hope that questions a,b,c,d,e,f will be remembered, and speakers from both camps will be asked to address them squarely and explicitly .

The fact that graduate students made up the majority of the participants gives me the hope that questions a,b,c,d,e,f will finally receive the attention they deserve.

As we discussed the virtues of graphs, I found it necessary to reiterate the observation that DAGs are more than just “natural and convenient way to express assumptions about causal structures” (Imbens and Rubin , 2013, p. 25). Praising their transparency while ignoring their inferential power misses the main role that graphs play in causal analysis. The power of graphs lies in computing complex implications of causal assumptions (i.e., the “science”) no matter in what language they are expressed.  Typical implications are: conditional independencies among variables and counterfactuals, what covariates need be controlled to remove confounding or selection bias, whether effects can be identified, and more. These implications could, in principle, be derived from any equivalent representation of the causal assumption, not necessarily graphical, but not before incurring a prohibitive computational cost. See, for example, what happens when economists try to replace d-separation with graphoid axioms

Following the discussion of representations, we addressed questions posed to us by the audience, in particular, five questions submitted by Professor Jon Krosnick (Political Science, Stanford).

I summarize them in the following slide:

Krosnick’s Questions to Panel
1) Do you think an experiment has any value without mediational analysis?
2) Is a separate study directly manipulating the mediator useful? How is the second study any different from the first one?
3) Imai’s correlated residuals test seems valuable for distinguishing fake from genuine mediation. Is that so? And how it is related to traditional mediational test?
4) Why isn’t it easy to test whether participants who show the largest increases in the posited mediator show the largest changes in the outcome?
5) Why is mediational analysis any “worse” than any other method of investigation?
My answers focused on question 2, 4 and 5, which I summarize below:

Q. Is a separate study directly manipulating the mediator useful?
Answer: Yes, it is useful if physically feasible but, still, it cannot give us an answer to the basic mediation question: “What percentage of the observed response is due to mediation?” The concept of mediation is necessarily counterfactual, i.e. sitting on the top layer of the causal hierarchy (see “Causality” chapter 1). It cannot be defined therefore in terms of population experiments, however clever. Mediation can be evaluated with the help of counterfactual assumptions such as “conditional ignorability” or “no interaction,” but these assumptions cannot be verified in population experiments.

Q. Why isn’t it easy to test whether participants who show the largest increases in the posited mediator show the largest changes in the outcome?
Answer: Translating the question to counterfactual notation the test suggested requires the existence of monotonic function f_m such that, for every individual, we have Y_1 – Y_0 =f_m (M_1 – M_0)

This condition expresses a feature we expect to find in mediation, but it cannot be taken as a DEFINITION of mediation. This condition is essentially the way indirect effects are defined in the Principal Strata framework (Frangakis and Rubin, 2002) the deficiencies of which are well known. See

In particular, imagine a switch S controlling two light bulbs L1 and L2. Positive correlation between L1 and L2 does not mean that L1 mediates between the switch and L2. Many examples of incompatibility are demonstrated in the paper above.

The conventional mediation tests (in the Baron and Kenny tradition) suffer from the same problem; they test features of mediation that are common in linear systems, but not the essence of mediation which is universal to all systems, linear and nonlinear, continuous as well as categorical variables.

Q. Why is mediational analysis any “worse” than any other method of investigation?
Answer: The answer is closely related to the one given to question 3). Mediation is not a “method” but a property of the population which is defined counterfactually, and therefore requires counterfactual assumption for evaluation. Experiments are not sufficient; and in this sense mediation is “worse” than other properties under investigation, eg., causal effects, which can be estimated entirely from experiments.

About the only thing we can ascertain experimentally is whether the (controlled) direct effect differs from the total effect, but we cannot evaluate the extent of mediation.

Another way to appreciate why stronger assumptions are needed for mediation is to note that non-confoundedness is not the same as ignorability. For non-binary variables one can construct examples where X and Y are not confounded ( i.e., P(y|do(x))= P(y|x)) and yet they are not ignorable, (i.e., Y_x is not independent of X.) Mediation requires ignorability in addition to nonconfoundedness.

Overall, the panel was illuminating, primarily due to the active participation of curious students. It gave me good reasons to believe that Political Science is destined to become a bastion of modern causal analysis. I wish economists would follow suit, despite the hurdles they face in getting causal analysis to economics education.


August 11, 2015

Mid-Summer Greeting from the UCLA Causality Blog

Filed under: Announcement,Causal Effect,Counterfactual,General — moderator @ 6:09 pm

Friends in causality research,

This mid-summer greeting of UCLA Causality blog contains:
A. News items concerning causality research
B. Discussions and scientific results

1. The next issue of the Journal of Causal Inference is scheduled to appear this month, and the table of content can be viewed here.

2. A new digital journal “Observational Studies” is out this month (link) and its first issue is dedicated to the legacy of William Cochran (1909-1980).

My contribution to this issue can be viewed here:

See also comment 1 below.

3. A video recording of my Cassel Lecture at the SER conference, June 2015, Denver, CO, can be viewed here:

4. A video of a conversation with Robert Gould concerning the teaching of causality can be viewed on Wiley’s Statistics Views, link (2 parts, scroll down).

5. We are informed of the upcoming publication of a new book, Rex Kline “Principles and Practice of Structural Equation Modeling, Fourth Edition (link). Judging by the chapters I read, this book promises to be unique; it treats structural equation models for what they are: carriers of causal assumptions and tools for causal inference. Kudos, Rex.

6. We are informed of another book on causal inference: Imbens, Guido W.; Rubin, Donald B. “Causal Inference in Statistics, Social, and Biomedical Sciences: An Introduction” Cambridge University Press (2015). Readers will quickly realize that the ideas, methods, and tools discussed on this blog were kept out of this book. Omissions include: Control of confounding, testable implications of causal assumptions, visualization of causal assumptions, generalized instrumental variables, mediation analysis, moderation, interaction, attribution, external validity, explanation, representation of scientific knowledge and, most importantly, the unification of potential outcomes and structural models.

Given that the book is advertised as describing “the leading analysis methods” of causal inference, unsuspecting readers will get the impression that the field as a whole is facing fundamental obstacles, and that we are still lacking the tools to cope with basic causal tasks such as confounding control and model testing. I do not believe mainstream methods of causal inference are in such state of helplessness.

The authors’ motivation and rationale for this exclusion were discussed at length on this blog. See
“Are economists smarter than epidemiologists”

and “On the First Law of Causal Inference”

As most of you know, I have spent many hours trying to explain to leaders of the potential outcome school what insights and tools their students would be missing if not given exposure to a broader intellectual environment, one that embraces model-based inferences side by side with potential outcomes.

This book confirms my concerns, and its insularity-based impediments are likely to evoke interesting public discussions on the subject. For example, educators will undoubtedly wish to ask:

(1) Is there any guidance we can give students on how to select covariates for matching or adjustment?.

(2) Are there any tools available to help students judge the plausibility of ignorability-type assumptions?

(3) Aren’t there any methods for deciding whether identifying assumptions have testable implications?.

I believe that if such questions are asked often enough, they will eventually evoke non-ignorable answers.

7. The ASA has come up with a press release yesterday, recognizing Tyler VanderWeele’s new book “Explanation in Causal Inference,” winner of the 2015 Causality in Statistics Education Award

Congratulations, Tyler.

Information on nominations for the 2016 Award will soon be announced.

8. Since our last Greetings (Spring, 2015) we have had a few lively discussions posted on this blog. I summarize them below:

8.1. Indirect Confounding and Causal Calculus
(How getting too anxious to criticize do-calculus may cause you to miss an easy solution to a problem you thought was hard).
July 23, 2015

8.2. Does Obesity Shorten Life? Or is it the Soda?
(Discusses whether it was the earth that caused the apple to fall? or the gravitational field created by the earth?.)
May 27, 2015

8.3. Causation without Manipulation
(Asks whether anyone takes this mantra seriously nowadays, and whether we need manipulations to store scientific knowledge)
May 14, 2015

8.4. David Freedman, Statistics, and Structural Equation Models
(On why Freedman invented “response schedule”?)
May 6, 2015

8.5. We also had a few breakthroughs posted on our technical report page

My favorites this summer are these two:
because they deal with the tough and long-standing problem:
“How generalizable are empirical studies?”

Enjoy the rest of the summer

May 6, 2015

David Freedman, Statistics, and Structural Equation Models

Filed under: Causal Effect,Counterfactual,Definition,structural equations — moderator @ 12:40 am

(Re-edited: 5/6/15, 4 pm)

Michael A Lewis (Hunter College) sent us the following query:

Dear Judea,
I was reading a book by the late statistician David Freedman and in it he uses the term “response schedule” to refer to an equation which represents a causal relationship between variables. It appears that he’s using that term as a synonym for “structural equation” the one you use. In your view, am I correct in regarding these as synonyms? Also, Freedman seemed to be of the belief that response schedules only make sense if the causal variable can be regarded as amenable to manipulation. So variables like race, gender, maybe even socioeconomic status, etc. cannot sensibly be regarded as causes since they can’t be manipulated. I’m wondering what your view is of this manipulation perspective.

My answer is: Yes. Freedman’s “response schedule” is a synonym for “structural equation.” The reason why Freedman did not say so explicitly has to do with his long and rather bumpy journey from statistical to causal thinking. Freedman, like most statisticians in the 1980’s could not make sense of the Structural Equation Models (SEM) that social scientists (e.g., Duncan) and econometricians (e.g., Goldberger) have adopted for representing causal relations. As a result, he criticized and ridiculed this enterprise relentlessly. In his (1987) paper “As others see us,” for example, he went as far as “proving” that the entire enterprise is grounded in logical contradictions. The fact that SEM researchers at that time could not defend their enterprise effectively (they were as confused about SEM as statisticians — judging by the way they responded to his paper) only intensified Freedman criticism. It continued well into the 1990’s, with renewed attacks on anything connected with causality, including the causal search program of Spirtes, Glymour and Scheines.

I have had a long and friendly correspondence with Freedman since 1993 and, going over a file of over 200 emails, it appears that it was around 1994 when he began to convert to causal thinking. First through the do-operator (by his own admission) and, later, by realizing that structural equations offer a neat way of encoding counterfactuals.

I speculate that the reason Freedman could not say plainly that causality is based on structural equations was that it would have been too hard for him to admit that he was in error criticizing a model that he misunderstood, and, that is so simple to understand. This oversight was not entirely his fault; for someone trying to understand the world from a statistical view point, structural equations do not make any sense; the asymmetric nature of the equations and those slippery “error terms” stand outside the prism of the statistical paradigm. Indeed, even today, very few statisticians feel comfortable in the company of structural equations. (How many statistics textbooks do we know that discuss structural equations?)

So, what do you do when you come to realize that a concept you ridiculed for 20 years is the key to understanding causation? Freedman decided not to say “I erred”, but to argue that the concept was not rigorous enough for statisticians to understood. He thus formalized “response schedule” and treated it as a novel mathematical object. The fact is, however, that if we strip “response schedule” from its superlatives, we find that it is just what you and I call a “function”. i.e., a mapping between the states of one variable onto the states of another. Some of Freedman’s disciples are admiring this invention (See R. Berk’s 2004 book on regression) but most people that I know just look at it and say: This is what a structural equation is.

The story of David Freedman is the story of statistical science itself and the painful journey the field has taken through the causal reformation. Starting with the structural equations of Sewal Wright (1921), and going through Freedman’s “response schedule”, the field still can’t swallow the fundamental building block of scientific thinking, in which Nature is encoded as a society of sensing and responding variables. Funny, econometrics is yet to start its reformation, though it has been housing SEM since Haavelmo (1943). (How many econometrics textbooks do we know which teach students how to read counterfactuals from structural equations?).

I now go to your second question, concerning the mantra “no causation without manipulation.” I do not believe anyone takes this slogan as a restriction nowadays, including its authors, Holland and Rubin. It will remain a relic of an era when statisticians tried to define causation with the only mental tool available to them: the randomized controlled trial (RCT).

I summed it up in Causality, 2009, p. 361: “To suppress talk about how gender causes the many biological, social, and psychological distinctions between males an females is to suppress 90% of our knowledge about gender differences”

I further elaborated on this issue in (Bollen and Pearl 2014 p. 313) saying:

Pearl (2011) further shows that this restriction has led to harmful consequence by forcing investigators to compromise their research questions only to avoid the manipulability restriction. The essential ingredient of causation, as argued in Pearl (2009: 361), is responsiveness, namely, the capacity of some variables to respond to variations in other variables, regardless of how those variations came about.”

In (Causality 2009 p. 361) I also find this paragraph: “It is for that reason, perhaps, that scientists invented counterfactuals; it permit them to state and conceive the realization of antecedent conditions without specifying the physical means by which these conditions are established;”

All in all, you have touched on one of the most fascinating chapters in the history of science, featuring a respectable scientific community that clings desperately to an outdated dogma, while resisting, adamantly, the light that shines around it. This chapter deserves a major headline in Kuhn’s book on scientific revolutions. As I once wrote: “It is easier to teach Copernicus in the Vatican than discuss causation with a statistician.” But this was in the 1990’s, before causal inference became fashionable. Today, after a vicious 100-year war of reformation, things are begining to change (See I hope your upcoming book further accelerates the transition.

April 24, 2015

Flowers of the First Law of Causal Inference (3)

Flower 3 — Generalizing experimental findings

Continuing our examination of “the flowers of the First Law” (see previous flowers here and here) this posting looks at one of the most crucial questions in causal inference: “How generalizable are our randomized clinical trials?” Readers of this blog would be delighted to learn that one of our flowers provides an elegant and rather general answer to this question. I will describe this answer in the context of transportability theory, and compare it to the way researchers have attempted to tackle the problem using the language of ignorability. We will see that ignorability-type assumptions are fairly limited, both in their ability to define conditions that permit generalizations, and in our ability to justify them in specific applications.

1. Transportability and Selection Bias
The problem of generalizing experimental findings from the trial sample to the population as a whole, also known as the problem of “sample selection-bias” (Heckman, 1979; Bareinboim et al., 2014), has received wide attention lately, as more researchers come to recognize this bias as a major threat to the validity of experimental findings in both the health sciences (Stuart et al., 2015) and social policy making (Manski, 2013).

Since participation in a randomized trial cannot be mandated, we cannot guarantee that the study population would be the same as the population of interest. For example, the study population may consist of volunteers, who respond to financial and medical incentives offered by pharmaceutical firms or experimental teams, so, the distribution of outcomes in the study may differ substantially from the distribution of outcomes under the policy of interest.

Another impediment to the validity of experimental finding is that the types of individuals in the target population may change over time. For example, as more individuals become eligible for health insurance, the types of individuals seeking services would no longer match the type of individuals that were sampled for the study. A similar change would occur as more individuals become aware of the efficacy of the treatment. The result is an inherent disparity between the target population and the population under study.

The problem of generalizing across disparate populations has received a formal treatment in (Pearl and Bareinboim, 2014) where it was labeled “transportability,” and where necessary and sufficient conditions for valid generalization were established (see also Bareinboim and Pearl, 2013). The problem of selection bias, though it has some unique features, can also be viewed as a nuance of the transportability problem, thus inheriting all the theoretical results established in (Pearl and Bareinboim, 2014) that guarantee valid generalizations. We will describe the two problems side by side and then return to the distinction between the type of assumptions that are needed for enabling generalizations.

The transportability problem concerns two dissimilar populations, Π and Π, and requires us to estimate the average causal effect P(yx) (explicitly: P(yx) ≡ P(Y = y|do(X = x)) in the target population Π, based on experimental studies conducted on the source population Π. Formally, we assume that all differences between Π and Π can be attributed to a set of factors S that produce disparities between the two, so that P(yx) = P(yx|S = 1). The information available to us consists of two parts; first, treatment effects estimated from experimental studies in Π and, second, observational information extracted from both Π and Π. The former can be written P(y|do(x),z), where Z is set of covariates measured in the experimental study, and the latters are written P(x, y, z) = P (x, y, z|S = 1), and P (x, y, z) respectively. In addition to this information, we are also equipped with a qualitative causal model M, that encodes causal relationships in Π and Π, with the help of which we need to identify the query P(yx). Mathematically, identification amounts to transforming the query expression

P(yx) = P(y|do(x),S = 1)

into a form derivable from the available information ITR, where

ITR = { P(y|do(x),z),  P(x,y,z|S = 1),   P(x,y,z) }.

The selection bias problem is slightly different. Here the aim is to estimate the average causal effect P(yx) in the Π population, while the experimental information available to us, ISB, comes from a preferentially selected sample, S = 1, and is given by P (y|do(x), z, S = 1). Thus, the selection bias problem calls for transforming the query P(yx) to a form derivable from the information set:

ISB = { P(y|do(x),z,S = 1), P(x,y,z|S = 1), P(x,y,z) }.

In the Appendix section, we demonstrate how transportability problems and selection bias problems are solved using the transformations described above.

The analysis reported in (Pearl and Bareinboim, 2014) has resulted in an algorithmic criterion (Bareinboim and Pearl, 2013) for deciding whether transportability is feasible and, when confirmed, the algorithm produces an estimand for the desired effects. The algorithm is complete, in the sense that, when it fails, a consistent estimate of the target effect does not exist (unless one strengthens the assumptions encoded in M).

There are several lessons to be learned from this analysis when considering selection bias problems.

1. The graphical criteria that authorize transportability are applicable to selection bias problems as well, provided that the graph structures for the two problems are identical. This means that whenever a selection bias problem is characterizes by a graph for which transportability is feasible, recovery from selection bias is feasible by the same algorithm. (The Appendix demonstrates this correspondence).

2. The graphical criteria for transportability are more involved than the ones usually invoked in testing treatment assignment ignorability (e.g., through the back-door test). They may require several d-separation tests on several sub-graphs. It is utterly unimaginable therefore that such criteria could be managed by unaided human judgment, no matter how ingenious. (See discussions with Guido Imbens regarding computational barriers to graph-free causal inference, click here). Graph avoiders, should reckon with this predicament.

3. In general, problems associated with external validity cannot be handled by balancing disparities between distributions. The same disparity between P (x, y, z) and P(x, y, z) may demand different adjustments, depending on the location of S in the causal structure. A simple example of this phenomenon is demonstrated in Fig. 3(b) of (Pearl and Bareinboim, 2014) where a disparity in the average reading ability of two cities requires two different treatments, depending on what causes the disparity. If the disparity emanates from age differences, adjustment is necessary, because age is likely to affect the potential outcomes. If, on the other hand the disparity emanates from differences in educational programs, no adjustment is needed, since education, in itself, does not modify response to treatment. The distinction is made formal and vivid in causal graphs.

4. In many instances, generalizations can be achieved by conditioning on post-treatment variables, an operation that is frowned upon in the potential-outcome framework (Rosenbaum, 2002, pp. 73–74; Rubin, 2004; Sekhon, 2009) but has become extremely useful in graphical analysis. The difference between the conditioning operators used in these two frameworks is echoed in the difference between Qc and Qdo, the two z-specific effects discussed in a previous posting on this blog (link). The latter defines information that is estimable from experimental studies, whereas the former invokes retrospective counterfactual that may or may not be estimable empirically.

In the next Section we will discuss the benefit of leveraging the do-operator in problems concerning generalization.

2. Ignorability versus Admissibility in the Pursuit of Generalization

A key assumption in almost all conventional analyses of generalization (from sample-to-population) is S-ignorability, written Yx ⊥ S|Z where Yx is the potential outcome predicated on the intervention X = x, S is a selection indicator (with S = 1 standing for selection into the sample) and Z a set of observed covariates. This condition, sometimes written as a difference Y1 − Y0 ⊥ S|Z, and sometimes as a conjunction {Y1, Y0} ⊥ S|Z, appears in Hotz et al. (2005); Cole and Stuart (2010); Tipton et al. (2014); Hartman et al. (2015), and possibly other researchers committed to potential-outcome analysis. This assumption says: If we succeed in finding a set Z of pre-treatment covariates such that cross-population differences disappear in every stratum Z = z, then the problem can be solved by averaging over those strata. (Lacking a procedure for finding Z, this solution avoids the harder part of the problem and, in this sense, it somewhat borders on the circular. It amounts to saying: If we can solve the problem in every stratum Z = z then the problem is solved; hardly an informative statement.)

In graphical analysis, on the other hand, the problem of generalization has been studied using another condition, labeled S-admissibility (Pearl and Bareinboim, 2014), which is defined by:

P (y|do(x), z) = P (y|do(x), z, s)

or, using counterfactual notation,

P(yx|zx) = P (yx|zx, sx)

It states that in every treatment regime X = x, the observed outcome Y is conditionally independent of the selection mechanism S, given Z, all evaluated at that same treatment regime.

Clearly, S-admissibility coincides with S-ignorability for pre-treatment S and Z; the two notions differ however for treatment-dependent covariates. The Appendix presents scenarios (Fig. 1(a) and (b)) in which post-treatment covariates Z do not satisfy S-ignorability, but satisfy S-admissibility and, thus, enable generalization to take place. We also present scenarios where both S-ignorability and S-admissibility hold and, yet, experimental findings are not generalizable by standard procedures of post-stratification. Rather the correct procedure is uncovered naturally from the graph structure.

One of the reasons that S-admissibility has received greater attention in the graph-based literature is that it has a very simple graphical representation: Z and X should separate Y from S in a mutilated graph, from which all arrows entering X have been removed. Such a graph depicts conditional independencies among observed variables in the population under experimental conditions, i.e., where X is randomized.

In contrast, S-ignorability has not been given a simple graphical interpretation, but it can be verified from either twin networks (Causality, pp. 213-4) or from counterfactually augmented graphs (Causality, p. 341), as we have demonstrated in an earlier posting on this blog (link). Using either representation, it is easy to see that S-ignorability is rarely satisfied in transportability problems in which Z is a post-treatment variable. This is because, whenever S is a proxy to an ancestor of Z, Z cannot separate Yx from S.

The simplest result of both PO and graph-based approaches is the re-calibration or post-stratification formula. It states that, if Z is a set of pre-treatment covariates satisfying S-ignorability (or S-admissibility), then the causal effect in the population at large can be recovered from a selection-biased sample by a simple re-calibration process. Specifically, if P(yx|S = 1,Z = z) is the z-specific probability distribution of Yx in the sample, then the distribution of Yx in the population at large is given by

P(yx) = ∑z  P(yx|S = 1,z)   P(z)  (*)

where P(z) is the probability of Z = z in the target population (where S = 0). Equation (*) follows from S-ignorability by conditioning on z and, adding S = 1 to the conditioning set – a one-line proof. The proof fails however when Z is treatment dependent, because the counterfactual factor P(yx|S = 1,z) is not normally estimable in the experimental study. (See Qc vs. Qdo discussion here).

As noted in (Keiding, 1987) this re-calibration formula goes back to 18th century demographers (Dale, 1777; Tetens, 1786) facing the task of predicting overall mortality (across populations) from age-specific data. Their reasoning was probably as follows: If the source and target populations differ in distribution by a set of attributes Z, then to correct for these differences we need to weight samples by a factor that would restore similarity to the two distributions. Some researchers view Eq. (*) as a version of Horvitz and Thompson (1952) post-stratification method of estimating the mean of a super-population from un-representative stratified samples. The essential difference between survey sampling calibration and the calibration required in Eq. (*) is that the calibrating covariates Z are not just any set by which the distributions differ; they must satisfy the S-ignorability (or admissibility) condition, which is a causal, not a statistical condition. It is not discernible therefore from distributions over observed variables. In other words, the re-calibration formula should depend on disparities between the causal models of the two populations, not merely on distributional disparities. This is demonstrated explicitly in Fig. 4(c) of (Pearl and Bareinboim, 2014), which is also treated in the Appendix (Fig. 1(a)).

While S-ignorability and S-admissibility are both sufficient for re-calibrating pre-treatment covariates Z, S-admissibility goes further and permits generalizations in cases where Z consists of post-treatment covariates. A simple example is the bio-marker model shown in Fig. 4(c) (Example 3) of (Pearl and Bareinboim, 2014), which is also discussed in the Appendix.


1. Many opportunities for generalization are opened up through the use of post-treatment variables. These opportunities remain inaccessible to ignorability-based analysis, partly because S-ignorability does not always hold for such variables but, mainly, because ignorability analysis requires information in the form of z-specific counterfactuals, which is often not estimable from experimental studies.

2. Most of these opportunities have been chartered through the completeness results for transportability (Bareinboim et al., 2014), others can be revealed by simple derivations in do-calculus as shown in the Appendix.

3. There is still the issue of assisting researchers in judging whether S-ignorability (or S-admissibility) is plausible in any given application. Graphs excel in this dimension because graphs match the format in which people store scientific knowledge. Some researchers prefer to do it by direct appeal to intuition; they do so at their own peril.

For references and appendix, click here.

January 22, 2015

Flowers of the First Law of Causal Inference (2)

Flower 2 — Conditioning on post-treatment variables

In this 2nd flower of the First Law, I share with readers interesting relationships among various ways of extracting information from post-treatment variables. These relationships came up in conversations with readers, students and curious colleagues, so I will present them in a question-answers format.

Rule 2 of do-calculus does not distinguish post-treatment from pre-treatment variables. Thus, regardless of the nature of Z, it permits us to replace P (y|do(x), z) with P (y|x, z) whenever Z separates X from Y in a mutilated graph GX (i.e., the causal graph, from which arrows emanating from X are removed). How can this rule be correct, when we know that one should be careful about conditioning on a post treatment variables Z?

Example 1 Consider the simple causal chain X → Y → Z. We know that if we condition on Z (as in case control studies) selected units cease to be representative of the population, and we cannot identify the causal effect of X on Y even when X is randomized. Applying Rule-2 however we get P (y|do(x), z) = P (y|x, z). (Since X and Y are separated in the mutilated graph X Y → Z). This tells us that the causal effect of X on Y IS identifiable conditioned on Z. Something must be wrong here.

To read more, click here.

December 22, 2014

Flowers of the First Law of Causal Inference

Filed under: Counterfactual,Definition,General,structural equations — judea @ 5:22 am

Flower 1 — Seeing counterfactuals in graphs

Some critics of structural equations models and their associated graphs have complained that those graphs depict only observable variables but: “You can’t see the counterfactuals in the graph.” I will soon show that this is not the case; counterfactuals can in fact be seen in the graph, and I regard it as one of many flowers blooming out of the First Law of Causal Inference (see here). But, first, let us ask why anyone would be interested in locating counterfactuals in the graph.

This is not a rhetorical question. Those who deny the usefulness of graphs will surely not yearn to find counterfactuals there. For example, researchers in the Imbens-Rubin camp who, ostensibly, encode all scientific knowledge in the “Science” = Pr(W,X,Y(0),Y(1)), can, theoretically, answer all questions about counterfactuals straight from the “science”; they do not need graphs.

On the other extreme we have students of SEM, for whom counterfactuals are but byproducts of the structural model (as the First Law dictates); so, they too do not need to see counterfactuals explicitly in their graphs. For these researchers, policy intervention questions do not require counterfactuals, because those can be answered directly from the SEM-graph, in which the nodes are observed variables. The same applies to most counterfactual questions, for example, the effect of treatment on the treated (ETT) and mediation problems; graphical criteria have been developed to determine their identification conditions, as well as their resulting estimands (see here and here).

So, who needs to see counterfactual variables explicitly in the graph?

There are two camps of researchers who may benefit from such representation. First, researchers in the Morgan-Winship camp (link here) who are using, interchangeably, both graphs and potential outcomes. These researchers prefer to do the analysis using probability calculus, treating counterfactuals as ordinary random variables, and use graphs only when the algebra becomes helpless. Helplessness arises, for example, when one needs to verify whether causal assumptions that are required in the algebraic derivations (e.g., ignorability conditions) hold true in one’s model of reality. These researchers understand that “one’s model of reality” means one’s graph, not the “Science” = Pr(W,X,Y(0),Y(1)), which is cognitively inaccessible. So, although most of the needed assumptions can be verified without counterfactuals from the SEM-graphs itself (e.g., through the back door condition), the fact that their algebraic expressions already carry counterfactual variables makes it more convenient to see those variables represented explicitly in the graph.

The second camp of researchers are those who do not believe that scientific knowledge is necessarily encoded in an SEM-graph. For them, the “Science” = Pr(W,X,Y(0),Y(1)), is the source of all knowledge and assumptions, and a graph may be constructed, if needed, as an auxiliary tool to represent sets of conditional independencies that hold in Pr(*). [I was surprised to discover sizable camps of such researchers in political science and biostatistics; possibly because they were exposed to potential outcomes prior to studying structural equation models.] These researchers may resort to other graphical representations of independencies, not necessarily SEM-graphs, but occasionally seek the comfort of the meaningful SEM-graph to facilitate counterfactual manipulations. Naturally, they would prefer to see counterfactual variables represented as nodes on the SEM-graph, and use d-separation to verify conditional independencies, when needed.

After this long introduction, let us see where the counterfactuals are in an SEM-graph. They can be located in two ways, first, augmenting the graph with new nodes that represent the counterfactuals and, second, mutilate the graph slightly and use existing nodes to represent the counterfactuals.

The first method is illustrated in chapter 11 of Causality (2nd Ed.) and can be accessed directly here. The idea is simple: According to the structural definition of counterfactuals, Y(0) (similarly Y(1)) represents the value of Y under a condition where X is held constant at X=0. Statistical variations of Y(0) would therefore be governed by all exogenous variables capable of influencing Y when X is held constant, i.e. when the arrows entering X are removed. We are done, because connecting these variables to a new node labeled Y(0), Y(1) creates the desired representation of the counterfactual. The book-section linked above illustrates this construction in visual details.

The second method mutilates the graph and uses the outcome node, Y, as a temporary surrogate for Y(x), with the understanding that the substitution is valid only under the mutilation. The mutilation required for this substitution is dictated by the First Law, and calls for removing all arrows entering the treatment variable X, as illustrated in the following graph (taken from here).

This method has some disadvantages compared with the first; the removal of X’s parents prevents us from seeing connections that might exist between Y_x and the pre-intervention treatment node X (as well as its descendants). To remedy this weakness, Shpitser and Pearl (2009) (link here) retained a copy of the pre-intervention X node, and kept it distinct from the manipulated X node.

Equivalently, Richardson and Robins (2013) spliced the X node into two parts, one to represent the pre-intervention variable X and the other to represent the constant X=x.

All in all, regardless of which variant you choose, the counterfactuals of interest can be represented as nodes in the structural graph, and inter-connections among these nodes can be used either to verify identification conditions or to facilitate algebraic operations in counterfactual logic.

Note, however, that all these variants stem from the First Law, Y(x) = Y[M_x], which DEFINES counterfactuals in terms of an operation on a structural equation model M.

Finally, to celebrate this “Flower of the First Law” and, thereby, the unification of the structural and potential outcome frameworks, I am posting a flowery photo of Don Rubin and myself, taken during Don’s recent visit to UCLA.

November 29, 2014

On the First Law of Causal Inference

Filed under: Counterfactual,Definition,Discussion,General — judea @ 3:53 am

In several papers and lectures I have used the rhetorical title “The First Law of Causal Inference” when referring to the structural definition of counterfactuals:

The more I talk with colleagues and students, the more I am convinced that the equation deserves the title. In this post, I will explain why.

As many readers of Causality (Ch. 7) would recognize, Eq. (1) defines the potential-outcome, or counterfactual, Y_x(u) in terms of a structural equation model M and a submodel, M_x, in which the equations determining X is replaced by a constant X=x. Computationally, the definition is straightforward. It says that, if you want to compute the counterfactual Y_x(u), namely, to predict the value that Y would take, had X been x (in unit U=u), all you need to do is, first, mutilate the model, replace the equation for X with X=x and, second, solve for Y. What you get IS the counterfactual Y_x(u). Nothing could be simpler.

So, why is it so “fundamental”? Because from this definition we can also get probabilities on counterfactuals (once we assign probabilities, P(U=u), to the units), joint probabilities of counterfactuals and observables, conditional independencies over counterfactuals, graphical visualization of potential outcomes, and many more. [Including, of course, Rubin’s “science”, Pr(X,Y(0),(Y1))]. In short, we get everything that an astute causal analyst would ever wish to define or estimate, given that he/she is into solving serious problems in causal analysis, say policy analysis, or attribution, or mediation. Eq. (1) is “fundamental” because everything that can be said about counterfactuals can also be derived from this definition.
[See the following papers for illustration and operationalization of this definition:
also, Causality chapter 7.]

However, it recently occurred on me that the conceptual significance of this definition is not fully understood among causal analysts, not only among “potential outcome” enthusiasts, but also among structural equations researchers who practice causal analysis in the tradition of Sewall Wright, O.D. Duncan, and Trygve Haavelmo. Commenting on the flood of methods and results that emerge from this simple definition, some writers view it as a mathematical gimmick that, while worthy of attention, need to be guarded with suspicion. Others labeled it “an approach” that need be considered together with “other approaches” to causal reasoning, but not as a definition that justifies and unifies those other approaches.

Even authors who advocate a symbiotic approach to causal inference — graphical and counterfactuals — occasionally fail to realize that the definition above provides the logic for any such symbiosis, and that it constitutes in fact the semantical basis for the potential-outcome framework.

I will start by addressing the non-statisticians among us; i.e., economists, social scientists, psychometricians, epidemiologists, geneticists, metereologists, environmental scientists and more, namely, empirical scientists who have been trained to build models of reality to assist in analyzing data that reality generates. To these readers I want to assure that, in talking about model M, I am not talking about a newly invented mathematical object, but about your favorite and familiar model that has served as your faithful oracle and guiding light since college days, the one that has kept you cozy and comfortable whenever data misbehaved. Yes, I am talking about the equation

that you put down when your professor asked: How would household spending vary with income, or, how would earning increase with education, or how would cholesterol level change with diet, or how would the length of the spring vary with the weight that loads it. In short, I am talking about innocent equations that describe what we assume about the world. They now call them “structural equations” or SEM in order not to confuse them with regression equations, but that does not make them more of a mystery than apple pie or pickled herring. Admittedly, they are a bit mysterious to statisticians, because statistics textbooks rarely acknowledge their existence [Historians of statistics, take notes!] but, otherwise, they are the most common way of expressing our perception of how nature operates: A society of equations, each describing what nature listens to before determining the value it assigns to each variable in the domain.

Why am I elaborating on this perception of nature? To allay any fears that what is put into M is some magical super-smart algorithm that computes counterfactuals to impress the novice, or to spitefully prove that potential outcomes need no SUTVA, nor manipulation, nor missing data imputation; M is none other but your favorite model of nature and, yet, please bear with me, this tiny model is capable of generating, on demand, all conceivable counterfactuals: Y(0),Y(1), Y_x, Y_{127}, X_z, Z(X(y)) etc. on and on. Moreover, every time you compute these potential outcomes using Eq. (1) they will obey the consistency rule, and their probabilities will obey the laws of probability calculus and the graphoid axioms. And, if your model justifies “ignorability” or “conditional ignorability,” these too will be respected in the generated counterfactuals. In other words, ignorability conditions need not be postulated as auxiliary constraints to justify the use of available statistical methods; no, they are derivable from your own understanding of how nature operates.

In short, it is a miracle.

Not really! It should be self evident. Couterfactuals must be built on the familiar if we wish to explain why people communicate with counterfactuals starting at age 4 (“Why is it broken?” “Lets pretend we can fly”). The same applies to science; scientists have communicated with counterfactuals for hundreds of years, even though the notation and mathematical machinery needed for handling counterfactuals were made available to them only in the 20th century. This means that the conceptual basis for a logic of counterfactuals resides already within the scientific view of the world, and need not be crafted from scratch; it need not divorce itself from the scientific view of the world. It surely should not divorce itself from scientific knowledge, which is the source of all valid assumptions, or from the format in which scientific knowledge is stored, namely, SEM.

Here I am referring to people who claim that potential outcomes are not explicitly represented in SEM, and explicitness is important. First, this is not entirely true. I can see (Y(0), Y(1)) in the SEM graph as explicitly as I see whether ignorability holds there or not. [See, for example, Fig. 11.7, page 343 in Causality]. Second, once we accept SEM as the origin of potential outcomes, as defined by Eq. (1), counterfactual expressions can enter our mathematics proudly and explicitly, with all the inferential machinery that the First Law dictates. Third, consider by analogy the teaching of calculus. It is feasible to teach calculus as a stand-alone symbolic discipline without ever mentioning the fact that y'(x) is the slope of the function y=f(x) at point x. It is feasible, but not desirable, because it is helpful to remember that f(x) comes first, and all other symbols of calculus, e.g., f'(x), f”(x), [f(x)/x]’, etc. are derivable from one object, f(x). Likewise, all the rules of differentiation are derived from interpreting y'(x) as the slope of y=f(x).

Where am I heading?
First, I would have liked to convince potential outcome enthusiasts that they are doing harm to their students by banning structural equations from their discourse, thus denying them awareness of the scientific basis of potential outcomes. But this attempted persuasion has been going on for the past two decades and, judging by the recent exchange with Guido Imbens (link), we are not closer to an understanding than we were in 1995. Even an explicit demonstration of how a toy problem would be solved in the two languages (link) did not yield any result.

Second, I would like to call the attention of SEM practitioners, including of course econometricians, quantitative psychologists and political scientists, and explain the significance of Eq. (1) in their fields. To them, I wish to say: If you are familiar with SEM, then you have all the mathematical machinery necessary to join the ranks of modern causal analysis; your SEM equations (hopefully in nonparametric form) are the engine for generating and understanding counterfactuals.; True, your teachers did not alert you to this capability; it is not their fault, they did not know of it either. But you can now take advantage of what the First Law of causal inference tells you. You are sitting on a gold mine, use it.

Finally, I would like to reach out to authors of traditional textbooks who wish to introduce a chapter or two on modern methods of causal analysis. I have seen several books that devote 10 chapters on SEM framework: identification, structural parameters, confounding, instrumental variables, selection models, exogeneity, model misspecification, etc., and then add a chapter to introduce potential outcomes and cause-effect analyses as useful new comers, yet alien to the rest of the book. This leaves students to wonder whether the first 10 chapters were worth the labor. Eq. (1) tells us that modern tools of causal analysis are not new comers, but follow organically from the SEM framework. Consequently, one can leverage the study of SEM to make causal analysis more palatable and meaningful.

Please note that I have not mentioned graphs in this discussion; the reason is simple, graphical modeling constitutes The Second Law of Causal Inference.

Enjoy both,

November 9, 2014

Causal inference without graphs

Filed under: Counterfactual,Discussion,Economics,General — moderator @ 3:45 am

In a recent posting on this blog, Elias and Bryant described how graphical methods can help decide if a pseudo-randomized variable, Z, qualifies as an instrumental variable, namely, if it satisfies the exogeneity and exclusion requirements associated with the definition of an instrument. In this note, I aim to describe how inferences of this type can be performed without graphs, using the language of potential outcome. This description should give students of causality an objective comparison of graph-less vs. graph-based inferences. See my exchange with Guido Imbens [here].

Every problem of causal inference must commence with a set of untestable, theoretical assumptions that the modeler is prepared to defend on scientific grounds. In structural modeling, these assumptions are encoded in a causal graph through missing arrows and missing latent variables. Graphless methods encode these same assumptions symbolically, using two types of statements:

1. Exclusion restrictions, and
2. Conditional independencies among observable and potential outcomes.

For example, consider the causal Markov chain which represents the structural equations:

with and being omitted factors such that X, , are mutually independent.

These same assumptions can also be encoded in the language of counterfactuals, as follows:

(3) represents the missing arrow from X to Z, and (4)-(6) convey the mutual independence of X, , and .
[Remark: General rules for translating graphical models to counterfactual notation are given in Pearl (2009, pp. 232-234).]

Assume now that we are given the four counterfactual statements (3)-(6) as a specification of a model; What machinery can we use to answer questions that typically come up in causal inference tasks? One such question is, for example, is the model testable? In other words, is there an empirical test conducted on the observed variables X, Y, and Z that could prove (3)-(6) wrong? We note that none of the four defining conditions (3)-(6) is testable in isolation, because each invokes an unmeasured counterfactual entity. On the other hand, the fact the equivalent graphical model advertises the conditional independence of X and Z given Y, X _||_ Z | Y, implies that the combination of all four counterfactual statements should yield this testable implication.

Another question often posed to causal inference is that of identifiability, for example, whether the
causal effect of X on Z is estimable from observational studies.

Whereas graphical models enjoy inferential tools such as d-separation and do-calculus, potential-outcome specifications can use the axioms of counterfactual logic (Galles and Pearl 1998, Halpern, 1998) to determine identification and testable implication. In a recent paper, I have combined the graphoid and counterfactual axioms to provide such symbolic machinery (link).

However, the aim of this note is not to teach potential outcome researchers how to derive the logical consequences of their assumptions but, rather, to give researchers the flavor of what these derivation entail, and the kind of problems the potential outcome specification presents vis a vis the graphical representation.

As most of us would agree, the chain appears more friendly than the 4 equations in (3)-(6), and the reasons are both representational and inferential. On the representational side we note that it would take a person (even an expert in potential outcome) a pause or two to affirm that (3)-(6) indeed represent the chain process he/she has in mind. More specifically, it would take a pause or two to check if some condition is missing from the list, or whether one of the conditions listed is redundant (i.e., follows logically from the other three) or whether the set is consistent (i.e., no statement has its negation follows from the other three). These mental checks are immediate in the graphical representation; the first, because each link in the graph corresponds to a physical process in nature, and the last two because the graph is inherently consistent and non-redundant. As to the inferential part, using the graphoid+counterfactual axioms as inference rule is computationally intractable. These axioms are good for confirming a derivation if one is proposed, but not for finding a derivation when one is needed.

I believe that even a cursory attempt to answer research questions using (3)-(5) would convince the reader of the merits of the graphical representation. However, the reader of this blog is already biased, having been told that (3)-(5) is the potential-outcome equivalent of the chain X—>Y—>Z. A deeper appreciation can be reached by examining a new problem, specified in potential- outcome vocabulary, but without its graphical mirror.

Assume you are given the following statements as a specification.

It represents a familiar model in causal analysis that has been throughly analyzed. To appreciate the power of graphs, the reader is invited to examine this representation above and to answer a few questions:

a) Is the process described familiar to you?
b) Which assumption are you willing to defend in your interpretation of the story.
c) Is the causal effect of X on Y identifiable?
d) Is the model testable?

I would be eager to hear from readers
1. if my comparison is fair.
2. which argument they find most convincing.

Next Page »

Powered by WordPress