Causal Analysis in Theory and Practice

January 29, 2020

On Imbens’s Comparison of Two Approaches to Empirical Economics

Filed under: Counterfactual,d-separation,DAGs,do-calculus,Imbens — judea @ 11:00 pm

Many readers have asked for my reaction to Guido Imbens’s recent paper, titled, “Potential Outcome and Directed Acyclic Graph Approaches to Causality: Relevance for Empirical Practice in Economics,” arXiv.19071v1 [stat.ME] 16 Jul 2019.

The note below offers brief comments on Imbens’s five major claims regarding the superiority of potential outcomes [PO] vis a vis directed acyclic graphs [DAGs].

These five claims are articulated in Imbens’s introduction (pages 1-3). [Quoting]:

” … there are five features of the PO framework that may be behind its current popularity in economics.”

I will address them sequentially, first quoting Imbens’s claims, then offering my counterclaims.

I will end with a comment on Imbens’s final observation, concerning the absence of empirical evidence in a “realistic setting” to demonstrate the merits of the DAG approach.

Before we start, however, let me clarify that there is no such thing as a “DAG approach.” Researchers using DAGs follow an approach called Structural Causal Model (SCM), which consists of functional relationships among variables of interest, and of which DAGs are merely a qualitative abstraction, spelling out the arguments in each function. The resulting graph can then be used to support inference tools such as d-separation and do-calculus. Potential outcomes are relationships derived from the structural model and several of their properties can be elucidated using DAGs. These interesting relationships are summarized in chapter 7 of (Pearl, 2009a) and in a Statistical Survey overview (Pearl, 2009c)

Imbens’s Claim # 1
“First, there are some assumptions that are easily captured in the PO framework relative to the DAG approach, and these assumptions are critical in many identification strategies in economics. Such assumptions include
monotonicity ([Imbens and Angrist, 1994]) and other shape restrictions such as convexity or concavity ([Matzkin et al.,1991, Chetverikov, Santos, and Shaikh, 2018, Chen, Chernozhukov, Fernández-Val, Kostyshak, and Luo, 2018]). The instrumental variables setting is a prominent example, and I will discuss it in detail in Section 4.2.”

Pearl’s Counterclaim # 1
It is logically impossible for an assumption to be “easily captured in the PO framework” and not simultaneously be “easily captured” in the “DAG approach.” The reason is simply that the latter embraces the former and merely enriches it with graph-based tools. Specifically, SCM embraces the counterfactual notation Y_x that PO deploys, and does not exclude any concept or relationship definable in the PO approach.

Take monotonicity, for example. In PO, monotonicity is expressed as

Y_x (u) ≥ Y_x’ (u) for all u and all x > x’

In the DAG approach it is expressed as:

Y_x (u) ≥ Y_x’ (u) for all u and all x > x’

(Taken from Causality pages 291, 294, 398.)

The two are identical, of course, which may seem surprising to PO folks, but not to DAG folks who know how to derive the counterfactuals Y_xfrom structural models. In fact, the derivation of counterfactuals in
terms of structural equations (Balke and Pearl, 1994) is considered one of the fundamental laws of causation in the SCM framework see (Bareinboim and Pearl, 2016) and (Pearl, 2015).

Imbens’s Claim # 2
“Second, the potential outcomes in the PO framework connect easily to traditional approaches to economic models such as supply and demand settings where potential outcome functions are the natural primitives. Related to this, the insistence of the PO approach on manipulability of the causes, and its attendant distinction between non-causal attributes and causal variables has resonated well with the focus in empirical work on policy relevance ([Angrist and Pischke, 2008, Manski, 2013]).”

Pearl’s Counterclaim #2
Not so. The term “potential outcome” is a late comer to the economics literature of the 20th century, whose native vocabulary and natural primitives were functional relationships among variables, not potential outcomes. The latters are defined in terms of a “treatment assignment” and hypothetical outcome, while the formers invoke only observable variables like “supply” and “demand”. Don Rubin cited this fundamental difference as sufficient reason for shunning structural equation models, which he labeled “bad science.”

While it is possible to give PO interpretation to structural equations, the interpretation is both artificial and convoluted, especially in view of PO insistence on manipulability of causes. Haavelmo, Koopman and Marschak would not hesitate for a moment to write the structural equation:

Damage = f (earthquake intensity, other factors).

PO researchers, on the other hand, would spend weeks debating whether earthquakes have “treatment assignments” and whether we can legitimately estimate the “causal effects” of earthquakes. Thus, what Imbens perceives as a helpful distinction is, in fact, an unnecessary restriction that suppresses natural scientific discourse. See also (Pearl, 2018; 2019).

Imbens’s Claim #3
“Third, many of the currently popular identification strategies focus on models with relatively few (sets of) variables, where identification questions have been worked out once and for all.”

Pearl’s Counterclaim #3

First, I would argue that this claim is actually false. Most IV strategies that economists use are valid “conditional on controls” (see examples listed in Imbens (2014)) and the criterion that distinguishes “good controls” from “bad controls” is not trivial to articulate without the help of graphs. (See, A Crash Course in Good and Bad Control). It can certainly not be discerned “once and for all”.

Second, even if economists are lucky to guess “good controls,” it is still unclear whether they focus on relatively few variables because, lacking graphs, they cannot handle more variables, or do they refrain from using graphs to hide the opportunities missed by focusing on few pre-fabricated, “once and for all” identification strategies.

I believe both apprehensions play a role in perpetuating the graph-avoiding subculture among economists. I have elaborated on this question here: (Pearl, 2014).

Imbens’s Claim # 4
“Fourth, the PO framework lends itself well to accounting for treatment effect heterogeneity in estimands ([Imbens and Angrist, 1994, Sekhon and Shem-Tov, 2017]) and incorporating such heterogeneity in estimation and the design of optimal policy functions ([Athey and Wager, 2017, Athey, Tibshirani, Wager, et al., 2019, Kitagawa and Tetenov, 2015]).”

Pearl’s Counterclaim #4
Indeed, in the early 1990s, economists felt ecstatic liberating themselves from the linear tradition of structural equation models and finding a framework (PO) that allowed them to model treatment effect heterogeneity.

However, whatever role treatment heterogeneity played in this excitement should have been amplified ten-fold in 1995, when completely non parametric structural equation models came into being, in which non-linear interactions and heterogeneity were assumed a priori. Indeed, the tools developed in the econometric literature cover only a fraction of the treatment-heterogeneity tasks that are currently managed by SCM. In particular, the latter includes such problems as “necessary and sufficient” causation, mediation, external validity, selection bias and more.

Speaking more generally, I find it odd for a discipline to prefer an “approach” that rejects tools over one that invites and embraces tools.

Imbens’s claim #5
“Fifth, the PO approach has traditionally connected well with design, estimation, and inference questions. From the outset Rubin and his coauthors provided much guidance to researchers and policy makers for practical implementation including inference, with the work on the propensity score ([Rosenbaum and Rubin, 1983b]) an influential example.”

Pearl’s Counterclaim #5
The initial work of Rubin and his co-authors has indeed provided much needed guidance to researchers and policy makers who were in a state of desperation, having no other mathematical notation to express causal questions of interest. That happened because economists were not aware of the counterfactual content of structural equation models, and of the non-parametric extension of those models.

Unfortunately, the clumsy and opaque notation introduced in this initial work has become a ritual in the PO framework that has prevailed, and the refusal to commence the analysis with meaningful assumptions has led to several blunders and misconceptions. One such misconception has been propensity score analysis which researchers have taken as a tool for reducing confounding bias. I have elaborated on this misguidance in Causality, Section 11.3.5, “Understanding Propensity Scores” (Pearl, 2009a).

Imbens’s final observation: Empirical Evidence
“Separate from the theoretical merits of the two approaches, another reason for the lack of adoption in economics is that the DAG literature has not shown much evidence of the benefits for empirical practice in settings that are important in economics. The potential outcome studies in MACE, and the chapters in [Rosenbaum, 2017], CISSB and MHE have detailed empirical examples of the various identification strategies proposed. In realistic settings they demonstrate the merits of the proposed methods and describe in detail the corresponding estimation and inference methods. In contrast in the DAG literature, TBOW, [Pearl, 2000], and [Peters, Janzing, and Schölkopf, 2017] have no substantive empirical examples, focusing largely on identification questions in what TBOW refers to as “toy” models. Compare the lack of impact of the DAG literature in economics with the recent embrace of regression discontinuity designs imported from the psychology literature, or with the current rapid spread of the machine learning methods from computer science, or the recent quick adoption of synthetic control methods [Abadie, Diamond, and Hainmueller, 2010]. All came with multiple concrete examples that highlighted their benefits over traditional methods. In the absence of such concrete examples the toy models in the DAG literature sometimes appear to be a set of solutions in search of problems, rather than a set of solutions for substantive problems previously posed in social sciences.”

Pearl’s comments on: Empirical Evidence
There is much truth to Imbens’s observation. The PO excitement that swept natural experimentalists in the 1990s came with outright rejection of graphical models. The hundreds, if not thousands, of empirical economists who plunged into empirical work, were warned repeatedly that graphical models may be “ill-defined,” “deceptive,” and “confusing,” and structural models have no scientific underpinning (see (Pearl, 1995; 2009b)). Not a single paper in the econometric literature has acknowledged the existence of SCM as an alternative or complementary approach to PO.

The result has been the exact opposite of what has taken place in epidemiology where DAGs became a second language to both scholars and field workers, [Due in part to the influential 1999 paper by Greenland, Pearl and Robins.] In contrast, PO-led economists have launched a massive array of experimental programs lacking graphical tools for guidance. I would liken it to a Phoenician armada exploring the Atlantic coast in leaky boats and no compass to guide its way.

This depiction might seem pretentious and overly critical, considering the pride with which natural experimentalists take in the results of their studies (though no objective verification of validity can be undertaken.) Yet looking back at the substantive empirical examples listed by Imbens, one cannot but wonder how much more credible those studies could have been with graphical tools to guide the way. These include a friendly language to communicate assumptions, powerful means to test their implications, and ample opportunities to uncover new natural experiments (Brito and Pearl, 2002).

Summary and Recommendation

The thrust of my reaction to Imbens’s article is simple:

It is unreasonable to prefer an “approach” that rejects tools over one that invites and embraces tools.

Technical comparisons of the PO and SCM approaches, using concrete examples, have been published since 1993 in dozens of articles and books in computer science, statistics, epidemiology, and social science, yet none in the econometric literature. Economics students are systematically deprived of even the most elementary graphical tools available to other researchers, for example, to determine if one variable is independent of another given a third, or if a variable is a valid IV given a set S of observed variables.

This avoidance can no longer be justified by appealing to “We have not found this [graphical] approach to aid the drawing of causal inferences” (Imbens and Rubin, 2015, page 25).

To open an effective dialogue and a genuine comparison between the two approaches, I call on Professor Imbens to assume leadership in his capacity as Editor in Chief of Econometrica and invite a comprehensive survey paper on graphical methods for the front page of his Journal. This is how creative editors move their fields forward.

References
Balke, A. and Pearl, J. “Probabilistic Evaluation of Counterfactual Queries,” In Proceedings of the Twelfth National Conference on Artificial Intelligence, Seattle, WA, Volume I, 230-237, July 31 – August 4, 1994.

Brito, C. and Pearl, J. “General instrumental variables,” In A. Darwiche and N. Friedman (Eds.), Uncertainty in Artificial Intelligence, Proceedings of the Eighteenth Conference, Morgan Kaufmann: San Francisco, CA, 85-93, August 2002.

Bareinboim, E. and Pearl, J. “Causal inference and the data-fusion problem,” Proceedings of the National Academy of Sciences, 113(27): 7345-7352, 2016.

Greenland, S., Pearl, J., and Robins, J. “Causal diagrams for epidemiologic research,” Epidemiology, Vol. 1, No. 10, pp. 37-48, January 1999.

Imbens, G. “Potential Outcome and Directed Acyclic Graph Approaches to Causality: Relevance for Empirical Practice in Economics,” arXiv.19071v1 [stat.ME] 16 Jul 2019.

Imbens, G. and Rubin, D. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge, MA: Cambridge University Press; 2015.

Imbens, Guido W. Instrumental Variables: An Econometrician’s Perspective. Statist. Sci. 29 (2014), no. 3, 323–358. doi:10.1214/14-STS480. https://projecteuclid.org/euclid.ss/1411437513

Pearl, J. “Causal diagrams for empirical research,” (With Discussions), Biometrika, 82(4): 669-710, 1995.

Pearl, J. “Understanding Propensity Scores” in J. Pearl’s Causality: Models, Reasoning, and Inference, Section 11.3.5, Second edition, NY: Cambridge University Press, pp. 348-352, 2009a.

Pearl, J. “Myth, confusion, and science in causal analysis,” University of California, Los Angeles, Computer Science Department, Technical Report R-348, May 2009b.

Pearl, J. “Causal inference in statistics: An overview” Statistics Surveys, Vol. 3, 96–146, 2009c.

Pearl, J. “Are economists smarter than epidemiologists? (Comments on Imbens’s recent paper),” Causal Analysis in Theory and Practice Blog, October 27, 2014.

Pearl, J. “Trygve Haavelmo and the Emergence of Causal Calculus,” Econometric Theory, 31: 152-179, 2015.

Pearl, J. “Does obesity shorten life? Or is it the Soda? On non-manipulable causes,” Journal of Causal Inference, Causal, Casual, and Curious Section, 6(2), online, September 2018.

Pearl, J. “On the interpretation of do(x),” Journal of Causal Inference, Causal, Casual, and Curious Section, 7(1), online, March 2019.

Comments (3)

January 1, 2000

d-Separation Without Tears

Filed under: d-separation — moderator @ 12:10 am

Introduction

d-separation is a criterion for deciding, from a given a causal graph, whether a set X of variables is independent of another set Y, given a third set Z. The idea is to associate "dependence" with "connectedness" (i.e., the existence of a connecting path) and "independence" with "unconnected-ness" or "separation". The only twist on this simple idea is to define what we mean by "connecting path", given that we are dealing with a system of directed arrows in which some vertices (those residing in Z) correspond to measured variables, whose values are known precisely. To account for the orientations of the arrows we use the terms "d-separated" and "d-connected" (d connotes "directional").

We start by considering separation between two singleton variables, x and y; the extension to sets of variables is straightforward (i.e., two sets are separated if and only if each element in one set is separated from every element in the other).

1. Unconditional separation

Rule 1: x and y are d-connected if there is an unblocked path between them.

By a "path" we mean any consecutive sequence of edges, disregarding their directionalities. By "unblocked path" we mean a path that can be traced without traversing a pair of arrows that collide "head-to-head". In other words, arrows that meet head-to-head do not constitute a connection for the purpose of passing information, such a meeting will be called a "collider".

Example 1

This graph contains one collider, at t. The path x-r-s-t is unblocked, hence x and t are d-connected. So is also the path t-u-v-y, hence t and y are d-connected, as well as the pairs u and y, t and v, t and u, x and s etc…. However, x and y are not d-connected; there is no way of tracing a path from x to y without traversing the collider at t. Therefore, we conclude that x and y are d-separated, as well as x and v, s and u, r and u, etc. (The ramification is that the covariance terms corresponding to these pairs of variables will be zero, for every choice of model parameters).

1.2 blocking by conditioning

Motivation: When we measure a set Z of variables, and take their values as given, the conditional distribution of the remaining variables changes character; some dependent variables become independent, and some independent variables become dependent. To represent this dynamics in the graph, we need the notion of "conditional d-connectedness" or, more concretely, "d-connectedness, conditioned on a set Z of measurements".

Rule 2: x and y are d-connected, conditioned on a set Z of nodes, if there is a collider-tree path between x and y that traverses no member of Z. If no such path exists, we say that x and y are d-separated by Z, We also say then that every path between x and y is "blocked" by Z.

Example 2

Let Z be the set {r, v} (marked by circles in the figure). Rule 2 tells us that x and y are d-separated by Z, and so are also x and s, u and y, s and u etc. The path x-r-s is blocked by Z, and so are also the paths u-v-y and s-t-u. The only pairs of unmeasured nodes that remain d-connected in this example, conditioned on Z, are s and t and u and t. Note that, although t is not in Z, the path s-t-u is nevertheless blocked by Z, since t is a collider, and is blocked by Rule 1.

1.3. Conditioning on colliders

Motivation: When we measure a common effect of two independent causes, the causes becomes dependent, because finding the truth of one makes the other less likely (or "explained away"), and refuting one implies the truth of the other. This phenomenon (known as Berkson paradox, or "explaining away") requires a slightly special treatment when we condition on colliders (representing common effects) or their descendants (representing effects of common effects).

Rule 3: If a collider is a member of the conditioning set Z, or has a descendant in Z, then it no longer blocks any path that traces this collider.

Example 3

Let Z be the set {r, p} (again, marked with circles). Rule 3 tells us that s and y are d-connected by Z, because the collider at t has a descendant (p) in Z, which unblocks the path s-t-u-v-y. However, x and u are still d-separated by Z, because although the linkage at t is unblocked, the one at r is blocked by Rule 2 (since r is in Z).

This completes the definition of d-separation, and the reader is invited to try it on some more intricate graphs, such as those shown in Figure 1.3

Typical application:
Suppose we consider the regression of y on p, r and x,

y = c₁ p + c₂ r + c₃x and suppose we wish to predict which coefficient in this regression is zero. From the discussion above we can conclude immediately that c₃ is zero, because y and x are d-separated given p and r, hence the partial correlation between y and x, conditioned on p and r, must vanish. c₁ and c₂, on the other hand, will in general not be zero, as can be seen from the graph: Z={r, x} does not d-separate y from p, and Z={p, x} does not d-separate y from r.

Remark on correlated errors:
Correlated exogenous variables (or error terms) need no special treatment. These are represented by bi-directed arcs (double-arrowed) and their arrowheads are treated as any other arrowhead for the purpose of path tracing. For example, if we add to the graph above a bi-directed arc between x and t, then y and x will no longer be d-separated (by Z={r, p}), because the path x-t-u-v-y is d-connected — the collider at t is unblocked by virtue of having a descendant, p, in Z.

Comments (2)