### On the First Law of Causal Inference

In several papers and lectures I have used the rhetorical title “The First Law of Causal Inference” when referring to the structural definition of counterfactuals:

The more I talk with colleagues and students, the more I am convinced that the equation deserves the title. In this post, I will explain why.

As many readers of Causality (Ch. 7) would recognize, Eq. (1) defines the potential-outcome, or counterfactual, Y_x(u) in terms of a structural equation model M and a submodel, M_x, in which the equations determining X is replaced by a constant X=x. Computationally, the definition is straightforward. It says that, if you want to compute the counterfactual Y_x(u), namely, to predict the value that Y would take, had X been x (in unit U=u), all you need to do is, first, mutilate the model, replace the equation for X with X=x and, second, solve for Y. What you get IS the counterfactual Y_x(u). Nothing could be simpler.

So, why is it so “fundamental”? Because from this definition we can also get probabilities on counterfactuals (once we assign probabilities, P(U=u), to the units), joint probabilities of counterfactuals and observables, conditional independencies over counterfactuals, graphical visualization of potential outcomes, and many more. [Including, of course, Rubin’s “science”, Pr(X,Y(0),(Y1))]. In short, we get everything that an astute causal analyst would ever wish to define or estimate, given that he/she is into solving serious problems in causal analysis, say policy analysis, or attribution, or mediation. Eq. (1) is “fundamental” because everything that can be said about counterfactuals can also be derived from this definition.

[See the following papers for illustration and operationalization of this definition:

http://ftp.cs.ucla.edu/pub/stat_ser/r431.pdf

http://ftp.cs.ucla.edu/pub/stat_ser/r391.pdf

http://ftp.cs.ucla.edu/pub/stat_ser/r370.pdf

also, Causality chapter 7.]

However, it recently occurred on me that the conceptual significance of this definition is not fully understood among causal analysts, not only among “potential outcome” enthusiasts, but also among structural equations researchers who practice causal analysis in the tradition of Sewall Wright, O.D. Duncan, and Trygve Haavelmo. Commenting on the flood of methods and results that emerge from this simple definition, some writers view it as a mathematical gimmick that, while worthy of attention, need to be guarded with suspicion. Others labeled it “an approach” that need be considered together with “other approaches” to causal reasoning, but not as a definition that justifies and unifies those other approaches.

Even authors who advocate a symbiotic approach to causal inference — graphical and counterfactuals — occasionally fail to realize that the definition above provides the logic for any such symbiosis, and that it constitutes in fact the semantical basis for the potential-outcome framework.

I will start by addressing the non-statisticians among us; i.e., economists, social scientists, psychometricians, epidemiologists, geneticists, metereologists, environmental scientists and more, namely, empirical scientists who have been trained to build models of reality to assist in analyzing data that reality generates. To these readers I want to assure that, in talking about model M, I am not talking about a newly invented mathematical object, but about your favorite and familiar model that has served as your faithful oracle and guiding light since college days, the one that has kept you cozy and comfortable whenever data misbehaved. Yes, I am talking about the equation

that you put down when your professor asked: How would household spending vary with income, or, how would earning increase with education, or how would cholesterol level change with diet, or how would the length of the spring vary with the weight that loads it. In short, I am talking about innocent equations that describe what we assume about the world. They now call them “structural equations” or SEM in order not to confuse them with regression equations, but that does not make them more of a mystery than apple pie or pickled herring. Admittedly, they are a bit mysterious to statisticians, because statistics textbooks rarely acknowledge their existence [Historians of statistics, take notes!] but, otherwise, they are the most common way of expressing our perception of how nature operates: A society of equations, each describing what nature listens to before determining the value it assigns to each variable in the domain.

Why am I elaborating on this perception of nature? To allay any fears that what is put into M is some magical super-smart algorithm that computes counterfactuals to impress the novice, or to spitefully prove that potential outcomes need no SUTVA, nor manipulation, nor missing data imputation; M is none other but your favorite model of nature and, yet, please bear with me, this tiny model is capable of generating, on demand, all conceivable counterfactuals: Y(0),Y(1), Y_x, Y_{127}, X_z, Z(X(y)) etc. on and on. Moreover, every time you compute these potential outcomes using Eq. (1) they will obey the consistency rule, and their probabilities will obey the laws of probability calculus and the graphoid axioms. And, if your model justifies “ignorability” or “conditional ignorability,” these too will be respected in the generated counterfactuals. In other words, ignorability conditions need not be postulated as auxiliary constraints to justify the use of available statistical methods; no, they are derivable from your own understanding of how nature operates.

In short, it is a miracle.

Not really! It should be self evident. Couterfactuals must be built on the familiar if we wish to explain why people communicate with counterfactuals starting at age 4 (“Why is it broken?” “Lets pretend we can fly”). The same applies to science; scientists have communicated with counterfactuals for hundreds of years, even though the notation and mathematical machinery needed for handling counterfactuals were made available to them only in the 20th century. This means that the conceptual basis for a logic of counterfactuals resides already within the scientific view of the world, and need not be crafted from scratch; it need not divorce itself from the scientific view of the world. It surely should not divorce itself from scientific knowledge, which is the source of all valid assumptions, or from the format in which scientific knowledge is stored, namely, SEM.

Here I am referring to people who claim that potential outcomes are not explicitly represented in SEM, and explicitness is important. First, this is not entirely true. I can see (Y(0), Y(1)) in the SEM graph as explicitly as I see whether ignorability holds there or not. [See, for example, Fig. 11.7, page 343 in Causality]. Second, once we accept SEM as the origin of potential outcomes, as defined by Eq. (1), counterfactual expressions can enter our mathematics proudly and explicitly, with all the inferential machinery that the First Law dictates. Third, consider by analogy the teaching of calculus. It is feasible to teach calculus as a stand-alone symbolic discipline without ever mentioning the fact that y'(x) is the slope of the function y=f(x) at point x. It is feasible, but not desirable, because it is helpful to remember that f(x) comes first, and all other symbols of calculus, e.g., f'(x), f”(x), [f(x)/x]’, etc. are derivable from one object, f(x). Likewise, all the rules of differentiation are derived from interpreting y'(x) as the slope of y=f(x).

Where am I heading?

First, I would have liked to convince potential outcome enthusiasts that they are doing harm to their students by banning structural equations from their discourse, thus denying them awareness of the scientific basis of potential outcomes. But this attempted persuasion has been going on for the past two decades and, judging by the recent exchange with Guido Imbens (link), we are not closer to an understanding than we were in 1995. Even an explicit demonstration of how a toy problem would be solved in the two languages (link) did not yield any result.

Second, I would like to call the attention of SEM practitioners, including of course econometricians, quantitative psychologists and political scientists, and explain the significance of Eq. (1) in their fields. To them, I wish to say: If you are familiar with SEM, then you have all the mathematical machinery necessary to join the ranks of modern causal analysis; your SEM equations (hopefully in nonparametric form) are the engine for generating and understanding counterfactuals.; True, your teachers did not alert you to this capability; it is not their fault, they did not know of it either. But you can now take advantage of what the First Law of causal inference tells you. You are sitting on a gold mine, use it.

Finally, I would like to reach out to authors of traditional textbooks who wish to introduce a chapter or two on modern methods of causal analysis. I have seen several books that devote 10 chapters on SEM framework: identification, structural parameters, confounding, instrumental variables, selection models, exogeneity, model misspecification, etc., and then add a chapter to introduce potential outcomes and cause-effect analyses as useful new comers, yet alien to the rest of the book. This leaves students to wonder whether the first 10 chapters were worth the labor. Eq. (1) tells us that modern tools of causal analysis are not new comers, but follow organically from the SEM framework. Consequently, one can leverage the study of SEM to make causal analysis more palatable and meaningful.

Please note that I have not mentioned graphs in this discussion; the reason is simple, graphical modeling constitutes The Second Law of Causal Inference.

Enjoy both,

Judea

[Comment from Hernando Casas regarding Rubin’s definition of “the science,” original text removed due to abusive language]

Comment by Hernando Casas — December 2, 2014 @ 5:03 am

Dear Hernando,

While I share some of your frustrations with Judea’s writing style, especially on his blog, I disagree with your reaction. Having known (and disagreed with) Judea for almost twenty years I can attest to him being a charming person, although sometimes his attempts at wit get the better of him. However, this is not uncommon among the greats in this area, going back to R. A. Fisher himself, and I am willing to cut Judea some slack here, given his great contributions to the study of causality. I also appreciate the fact that Judea puts in the effort to run the blog and makes it open to discussion. Let us not abuse his gracious hospitality.

On the substance, I think the “First Law of Causal Inference” is another pearl of wisdom. I never thought it merited a special label, but if Judea wishes to adorn it as such, that is fine with me. Doing so does not change what I would do. When Judea writes: “Eq. (1) is “fundamental” because everything that can be said about counterfactuals can also be derived from this definition,” I don’t really know what he means. In the way I think about these problems (I hesitate to use the word “approach” because Judea has made that a loaded term), I start with the potential outcomes, not with equation (1). I guess that makes me a “potential outcome enthusiast” in Judea’s world, in other words, someone who does not see equation (1) as “a definition that justifies and unifies those other approaches.” So let it be.

I do agree with Judea’s point about the textbooks. There is no need to wait till chapter 11 to introduce potential outcomes and casuality. If I were to write an econometric textbook, potential outcomes would be upfront in chapter 1.

Comment by guido imbens — December 3, 2014 @ 12:13 am

Dear Hernando,

I think we are victims of a terrible misunderstanding.

When I wrote: Rubin’s “science”, I was referring to a probability distribution that Don Rubin labeled

“The science” in several of his papers; it is the joint probability of covariates X, the treatment variable W

and the potential outcome Y(0) and Y(1), written Pr(W,X,Y(0),Y(1)). My humble claim was, that I find it to be

a miracle that this horrible looking joint probability, on a mixture of observables and hypothetical variables,

Pr(W,X,Y(0),Y(1)), is encoded safely and meaningfully in our tiny and familiar structural equation model M.

I think your frustration with my writing style, and correct me if I am wrong, stems from

assuming that the phrase “Rubin’s “science” meant to mock Rubin’s work. No way! All those who know me

and my writings also know how much I admire Don’s contributions to statistics. The phrase “Rubin’s “science””

refers simply to the joint distribution Pr(W,X,Y(0),Y(1)) which Don Rubin (and Guido too, if I am not mistaken) labeled

“The Science”.

Now to substance.

If a scientists like Rubin and Imbens call Pr(W,X,Y(0),Y(1)) “The Science”, and if I can derive

this distribution simply, and upon demand from a 3-variable structural equation model, am I not justified in

calling it “a miracle”? And ain’t I justified in calling it “The First Law of Causal Inference”?

So, why am I arrogant-jerk if I bring this breath taking realization to the attention of people who are not aware

of this way of defining potential outcomes? I am generally very excited about my work and I naturally assume that my

readers are as excited whenever new tools are brought to their attention.

Aren’t you?

Just think about your last causal inference problem you had, where you had to decide if this terrible looking

independence holds:

W || {Y(0)Y(1)} |X

Now imagine someone telling you: Hey, Hernando, no need to sweat, just put down the equations

that connect X, W and Y (all are observables) and you will be able to tell immediately, even without the form of the equations,

whether this independence holds or not. Wouldn’t you be happy?

That is why I am happy,

Join me,

Judea

Comment by judea pearl — December 3, 2014 @ 1:45 am

Dear Guido,

Thanks for defending my character; I am sure one day you will appreciate the substance too.

When I read your advice: “If I were to write an econometric textbook, potential outcomes

would be upfront in chapter 1.” I was about to say: Why not? There are many roads to Rome. But I paused.

I paused because I saw a glaring asymmetry in our communication.

I have been explaining at length the computational and cognitive advantages of starting with M and defining potential outcomes from M.

You expressed preference to doing it the other way around, with no explanation. Can you back up your preference with some tangible advantages?

I hope you do,

Judea

Comment by judea pearl — December 3, 2014 @ 2:13 am

Dear Judea and Hernando,

“I am sure one day you will appreciate the substance too.’’ I was always under the impression that even if I did not always agree with computer scientists on causality, that at least they were good at prediction. Now I am not even sure about that anymore!

I do think you need to be honest about your writing and that it is intended to be subtly mock people you disagree with. Even now you write that Don and I label the joint distribution “The Science.’’ I don’t know where you get that from. In our book we write about “the science.’’ Capitalizing “science’’ changes the tone. It is not capitalized in the book, and that is done for a reason. It’s not my style. Similarly I don’t personally like statements like “The First Law of Causal Inference’’ and agree with Hernando that it rubs people the wrong way. Being enthusiastic about your own work is one thing (and a good thing!). Accusing others of “doing harm to their students’’ is not necessary, and if you do so you should not be surprised if people take offense.

Re the last part. You write to Hernando “ just put down the equations that connect X, W and Y (all are observables)’’ Therein lies the rub. I think it is hard to write down that set of equations (“no need to sweat’’???) and prefer starting with the potential outcomes and think about the joint distribution of the potential outcomes and X. I think that is easier and less likely to lead to mistakes. In economics we often think about agents making choices/decisions based on (perceptions of) potential outcomes, which leads naturally to those formulations. Again, in my world view many roads lead to Rome, and if you want to do things differently that is fine, but I do not find your road as sweat-free as you make it out to be.

Sincerely,

Guido imbens

Comment by guido imbens — December 3, 2014 @ 3:29 pm

Dear Guido,

I think we are finally converging towards a substantive discussion of the real issue. We have two concepts

of “science” which are now displayed before us explicitly. Let us call them Science-1 and Science-2.

In Science-1 we have 2 or 3 structural equations, like

Z=h(U1)

X=f(Z,U2)

Y=g(X,U3)

In Science-2 we have the joint distribution of X, Z and the potential outcomes Y(0) Y(1):

Pr(X,Y(1), Y(0), Z)

I have already posted two pages on why Science-1 is computationally and cognitively more suitable for

causal inference and, by extension, more suitable to start econometric textbooks with.

You now have a golden opportunity to leverage the level of concreteness that we have achieved and show

why you “prefer starting with the potential outcomes and think about the joint distribution of

the potential outcomes and X.” (quoted from you last posting).

For example, you can tell us how you represent Pr() formally (if you do) or, if you do not represent it

explicitly, how you use a mental representation of it to decide on its properties, for example,

whether the following ignorability conditions hold in Pr():

X_||_{Y(0),Y(1)}

X_||_{Y(0),Y(1)} | Z

Z_||_{Y(0),Y(1)} | X

Z_||_{Y(0),Y(1)}

Again, I hope you would tell us WHY you prefer one science over the other, not merely that you happened

to prefer, or that “I think that is easier and less likely to lead to mistakes” [your quote]. Moreover, if

you want to invoke “agents making choices/decision based on (perception of) potential outcomes”, go ahead,

add those agents to Science-2 and proceed. But, eventually, we need to hear how you reason about Pr(),

and how you go about confirming or dis-confirming ignorability conditions such as those above,

because no inference can proceed without such conditions.

The floor is yours,

Judea

Comment by judea pearl — December 4, 2014 @ 12:57 am

Dear Judea,

Happy to oblige. Let me take a classic example from the comment by Toru Kitagawa on my Statistical Science paper that we discussed in the earlier thread (the comment, which I highly recommend, is also published in Statistical Science). Toru considers the “classical problem of estimation of a production function. Q denotes the quantity of a homogeneous good produced and L is a measure of an input used. For simplicity let us consider only a labor input (e.g., total hours worked by the employees.” In addition to the quantity produced Q_i and the labor input L_i we also observe the wage rate w_i that firm i faces for a number of firms, indexed by i running from 1 to N.

In line with my comments about my preferences for potential outcomes, Kitagawa does not start by specifying a model for the three variables (Q_i, L_i, w_i). He does not say why, but my guess would be that this would be very difficult, and unnatural for an economist.

Instead Kitagawa starts with the production function Q_i(L) which describes the potential outcomes for quantity produced as a function of the labor input. He writes down a model for these potential outcomes. Specifically, Kitagawa writes: “let us assume that the production technology of firm i is given by the following function: Q_i(L)=exp(b0+a_i)L^b1,” followed by: “This equation can be indeed interpreted as the causal relationship between output and input in the production process of firm i.” a_i here represents unobserved differences between the firms, e.g., the quality of the management or the fixed capital, for example soil quality if the firms were farms. This particular model is of course very simple, but we often have credible assumptions about the relation between the potential outcomes. For example, in the production function example it is generally reasonable to assume that the production function is monotone in its inputs. (Elias and Bryant in the earlier discussion argued there was no place for such assumption in their early “nonparametric’’ analysis – in contrast such assumptions are viewed as very natural in many economics settings.) These assumptions are part of what Don and I meant when we referred to “the science’’ (lowercase please!)

So, the starting point is this set of potential outcomes, Q_i(L). Kitagawa then considers the decision by the firm regarding the quantity of the labor input, and proposes that this decision follows the rule

L_i=arg max_L {p Q_i(L)-w_iL}

where p is the price of the good. In words, the firms choose the labor input to maximize profits. The value of the labor input variable is determined by the full set of potential outcomes and other variables. One may want to consider alternative decision rules by the firm, but this is a common one in economics. Unconfoundedness here would correspond to L_i being independent of all the Q_i(L) conditional on other stuff, but that would correspond to a very unusual and inefficient firm that would be unlikely to survive for long in a competitive environment.

This is precisely what I mean that it is easier or more natural for me, and I think for many economists, to think of a model for the potential outcomes than for the realized values. It would be difficult to specify directly the link between the observed labor input and the quantity produced, because the choice for the labor input depends on the entire set of potential outcomes. Although in many econometrics textbooks the potential outcomes are not explicitly introduced, they are explicit in the economic theory texts that all economists are exposed to and therefore resonate well with us.

Of course this way of thinking about these problems may be more natural in economics than in, say, epidemiology, and this is why I disagreed with Judea’s blithe dismissal of differences between the disciplines when he wrote that “ Or, are problems in economics different from those in epidemiology? I have examined the structure of typical problems in the two fields, the number of variables involved, the types of data available, and the nature of the research questions. The problems are strikingly similar.” No, they are not!

Sincerely,

guido

Comment by guido imbens — December 5, 2014 @ 12:57 pm

Dear Guido,

The example you posted further supports my claim that it is cognitively impossible to work with Science-2,

(namely Pr(X,W,Y(0), Y(1))) and that, to specify a problem one needs to resort to structural equations,

namely, to Science-1.

Kitagawa clearly recognized this fact, when he said: “In econometrics terminology, equation (1.2) [in the

paper] is interpreted as a structural equation in the sense that it can generate any counterfactual outcomes of unit i

with respect to any manipulations of x.”

In other words, we can take ANY textbook structural equation y = f(x,u) , put an “(x)” after the y, then read it:

Y(x) = f(x,u). Lo and behold, as if by miracle, we obtained the potential outcome Y(x). This is indeed part of the

The First Law, for the single equation case.

The First Law miracle goes a bit further, guaranteeing that we can generate the entirety of Science-2

Pr(X,W,Y(0), Y(1)) from Science-1. But, at this point, it suffices to note that your example does not

start with Science 2, but in the language of Science-1. That a researcher may choose not to write W_i

(production activity) explicitly, but absorb it into U, does not negate the fact that

the equation is structural, namely, all its components are observables; the potential outcome

Y(x) is not IN the equation but is derived FROM the equation, precisely as dictated by the First Law.

Where are we now?

To be convinced that the potential outcome science (namely, Pr(X,W,Y(0), Y(1))) really

acts like a “Science”, namely, a mathematical object that represents a researcher’s perception of

reality, we need to choose a problem that can be presented and solved in Science-2 without borrowing

equations from Science-1.

May I suggest that we start with the IV setting, a setting that we all know fairly well, present it in

Science-2 language and we can then compare Science-1 to Science-2 on various dimensions of comparison.

Any other problem would do as well, but it needs to be presented in Science-2 language.

The distinction between Science-1 and Science-2 was made crispy clear by Don Rubin; Science-2 is

Pr(X,W,Y(0), Y(1))) and Science-1 are equations expressed in terms of OBSERVABLES, X, Y, Z, W…

as in classical econometric texts, where Y(0) and Y(1) do not appear explicitly, but are replaced with

“error terms”, “shocks” “disturbances” “omitted factors” “latent drivers” “exogenous variables”,

it terms of which economists encode what they know about the world.

We know a lot about Science-1, can you show us how to start a problem with Science-2?

Judea

Comment by judea pearl — December 7, 2014 @ 5:27 pm

Dear Judea,

This discussion is rapidly losing focus again, like the discussion in the previous thread. I will make one last attempt to clarify things.

In my previous comment I wrote that in Kitagawa’s example the observed labor input L^obs_i was determined by the key equation

L^obs_i=argmax_L {p Q_i(L)-w_i L}

Thus, in a characterization that makes perfect sense to economists, the level of the input is choosen to maximize profits p Q_i(L)-w_i L. The profits depend on the production function Q_i(L), which is the set of potential values for production as a function of the labor input. In other words, the realized value for the input depends on the full set of potential outcomes, that is, the Q_i(L), for all values of L, not just on the observed value Q_i(L^obs_i).

You write that in “Science-1 are equations expressed in terms of OBSERVABLES, X, Y, Z, W.” (as opposed to being expression in terms of potential outcomes). Your Science-1 definition clearly does not fit the above equation characterizing the value of the labor input, so your claim that my example supports your claim that “to specify a problem one needs to resort to structural equations, namely, to Science-1” rather than potential outcomes, makes so little sense that it is unlikely to convince the informed readers of your blog.

Sincerely,

Guido Imbens

Comment by guido imbens — December 7, 2014 @ 11:59 pm

Dear Guido,

Three brief comments

1. If I understand you correctly, you seem to be saying that, in order to move from Science-1 (classical econometrics)

to Science-2 (Potential Outcomes) all one needs to do is go over the classical expressions for expected-utility-maximization and change the action index of the utility term (U_x) to a potential-outcome parenthesis U(L). Easy! And, if this is so,

I would gladly join Science-2 (if deemed eligible). But, then, I do not understand why you and Rubin write chapters on the

advantages of Science-2 over Science-1; they seem identical save for a minor change in parenthesis.

2. In non-parametric settings, it does not matter if an agent maximizes her expected utility or minimizes

that utility or just obeys instructions. All that matters is that the agent responds to certain signals

and not to others, the response function itself need no be specified. Therefore, your equation

L^obs_i=argmax_L {p Q_i(L)-w_i L} might as well be written L^obs_i= f_i(p, W_i); the L is maximized over.

3. We still did not see a single example of Science-2; Pr(X,W,Y(0), Y(2)).

Does this probability function exist?

Trying to keep this discussion focused (on Pr),

Judea

Comment by judea pearl — December 8, 2014 @ 4:47 am

Guido, Hernando, and other readers,

I am answering the question that I posed above.

Yes, the probability function Pr(W,X, Y(0), Y(1)) does exist.

The fact that we have not seen it represented explicitly does not mean that it does not exist as an abstract

mathematical object, postulated for the purpose of maintaining coherence among properties, such as ignorability,

that are needed to justify the use of available statistical methods.

A simple proof that Pr(*) exists is that we can derive it, or its needed properties,

from structural equations (using the First Law, see eq. (1)) and be assured that those

derived properties cohere, as though they came directly from some Pr(*). This is nice.

What is still a puzzle to me is why authors who revere Pr(*) as “the science” (with or w/o quotes)

do not rejoice in glee at this capability of structural equations to represent “the science” so

compactly and meaningfully, and why they shun this capability with such zeal.

I have my own explanation for this puzzle, but I would rather leave it to future historians of statistics

to analyze and be mused by.

My agenda for the next week or so is to return to the miracle of the First Law and share with readers the

clarity and unification that shine from its wrinkles.

Judea

Comment by judea pearl — December 13, 2014 @ 2:21 am

[…] and I regard it as one of many flowers blooming out of the First Law of Causal Inference (see here). But, first, let us ask why anyone would be interested in locating counterfactuals in the […]

Pingback by Causal Analysis in Theory and Practice » Flowers of the First Law of Causal Inference — December 22, 2014 @ 5:22 am

[…] this 2nd flower of the First Law, I share with readers interesting relationships among various ways of extracting information from […]

Pingback by Causal Analysis in Theory and Practice » Flowers of the First Law of Causal Inference — January 22, 2015 @ 9:22 pm

[…] on to another question: “Why some economists refuse to benefit from the First Law” (link). I am convinced that this refusal reflects resistance to accept the fact that structural equations […]

Pingback by Causal Analysis in Theory and Practice » Winter Greeting from the UCLA Causality Blog — January 27, 2015 @ 7:34 am