On Wednesday December 23 I had the honor of participating in “AI Debate 2”, a symposium organized by Montreal AI, which brought together an impressive group of scholars to discuss the future of AI. I spoke on

“The Domestication of Causal Reasoning: Cultural and Methodological Implications,”

and the reading list I proposed as background material was:

- “The Seven Tools of Causal Inference with Reflections on Machine Learning,” July 2018 https://ucla.in/2HI2yyx
- “Radical Empiricism and Machine Learning Research,” July 26, 2020 https://ucla.in/32YKcWy
- “Data versus Science: Contesting the Soul of Data-Science,” July 7, 2020 https://ucla.in/3iEDRVo

The debate was recorded here https://montrealartificialintelligence.com/aidebate2/ and my talk can be accessed here: https://youtu.be/gJW3nOQ4SEA

Below is an edited script of my talk.

This is the first time I am using the word “domestication” to describe what happened in causality-land in the past 3 decades. I’ve used other terms before: “democratization,” “mathematization,” or “algorithmization,” but Domestication sounds less provocative when I come to talk about the causal revolution.

What makes it a “revolution” is seeing dozens of practical and conceptional problems that only a few decades ago where thought to be metaphysical or unsolvable give way to simple mathematical solutions.

“DEEP UNDERSTANDING” is another term used here for the first time. It so happened that, while laboring to squeeze out results from causal inference engines, I came to realize that we are sitting on a gold mine, and what we are dealing with is none other but:

A computational model of a mental state that deserves the title “Deep Understanding”

“Deep Understanding” is not the nebulous concept that you probably think it is, but something that is defined formally as any system capable of covering all 3 levels of the causal hierarchy: What is – What if – Only if. More specifically: What if I see (prediction) – What if I do (intervention) – and what if acted differently (retrospection, in light of the outcomes observed).

This may sound like cheating – I take the capabilities of one system (i.e., a causal model) and I posit them as a general criterion for defining a general concept such as: “Deep Understanding.”

It isn’t cheating. Given that causal reasoning is so deeply woven into our day to day language, our thinking, our sense of justice, our humor and of course our scientific understanding, I think that it won’t be too presumptuous of me to propose that we take Causal Modeling as a testing ground of ideas on other modes of reasoning associated with “understanding.”

Specifically, causal models should provide an arena for various theories explanations, fairness, adaptation, imagination, humor, consciousness, free will, attention, and curiosity.

I also dare speculate that learning from the way causal reasoning was domesticated, would benefit researchers in other area of AI, including vision and NLP, and enable them to examine whether similar paths could be pursued to overcome obstacles that data-centric paradigms have imposed.

I would like now to say a few words on the Anti-Cultural implications of the Causal revolution. Here I refer you to my blog post, https://ucla.in/32YKcWy where I argue that radical empiricism is a stifling culture. It lures researchers into a data-centric paradigm, according to which Data is the source of all knowledge rather than a window through which we learn about the world around us.

What I advocate is a hybrid system that supplements data with domain knowledge, commonsense constraints, culturally transmitted concepts, and most importantly, our innate causal templates that enable toddlers to quickly acquire an understanding of their toy-world environment.

It is hard to find a needle in a hay stack, it is much harder if you haven’t seen a needle before. The module we are using for causal inference gives us a picture of what the needle looks like and what you can do once you find one.

When the paper first appeared, in 2001, I had the impression that, although the word “cause” did not appear explicitly, Breiman was trying to distinguish data-descriptive models from models of the data-generation process, also called “causal,” “substantive,” “subject-matter,” or “structural” models. Unhappy with his over-emphasis on prediction, I was glad nevertheless that a statistician of Breiman’s standing had recognized the on-going confusion in the field, and was calling for making the distinction crisp.

Upon re-reading the paper in 2020 I have realized that the two cultures contrasted by Breiman are not descriptive vs. causal but, rather, two styles of descriptive modeling, one interpretable, the other uninterpretable. The former is exemplified by predictive regression models, and the latter by modern big-data algorithms such as deep-learning, BART, trees and forests. The former carries the potential of being interpreted as causal, the latter leaves no room for such interpretation; it describes the prediction process chosen by the analyst, not the data-generation process chosen by nature. Breiman’s main point is: If you want prediction, do prediction for its own sake and forget about the illusion of representing nature.

Breiman’s paper deserves its reputation as a forerunner of modern machine learning techniques, but falls short of telling us what we should do if we want the model to do more than just prediction, say, to extract some information about how nature works, or to guide policies and interventions. For him, accurate prediction is the ultimate measure of merit for statistical models, an objective shared by present day machine learning enterprise, which accounts for many of its limitations (https://ucla.in/2HI2yyx).

In their comments on Breiman’s paper, David Cox and Bradley Efron noticed this deficiency and wrote:

“… fit, which is broadly related to predictive success, is not the primary basis for model choice and formal methods of model choice that take no account of the broader objectives are suspect. [The broader objectives are:] to establish data descriptions that are potentially causal.” (Cox, 2001)

And Efron concurs:

“Prediction by itself is only occasionally sufficient. … Most statistical surveys have the identification of causal factors as their ultimate goal.” (Efron, 2001)

As we read Breiman’s paper today, armed with what we know about the proper symbiosis of machine learning and causal modeling, we may say that his advocacy of algorithmic prediction was justified. Once guided by a causal model for identification and bias reduction, the predictive component of our model can safely be trusted to non-interpretable algorithms. The interpretation can be accomplished separately by the causal component of our model, as demonstrated, for example, in https://ucla.in/2HI2yyx.

Separating data-fitting from interpretation, an idea that was rather innovative in 2001, has withstood the test of time.

Judea

**ADDENDUM-1**

**ADDENDUM-2**

The following is an email exchange between Ying Nian Wu (UCLA, Statistics) and Judea Pearl (UCLA, Computer Science/Statistics).

Dear Judea,

I feel all models are about making predictions for future observations. The only difference is that causal model is to predict *p*(*y*|*do*(*x*)) in your notation, where the testing data (after cutting off the arrows into *x* by your diagram surgery) come from a different distribution than the training data, i.e., we want to extrapolate from training data to testing data (in fact, extrapolation and interpolation are relative — a simple model that can interpolate a vast range is quite extrapolative). Ultimately a machine learning model also wants to achieve extrapolative prediction, such as the so-called transfer learning and meta learning, where testing data are different from training data, or the current short-term experience (small new training data) is different from the past long-term experience (big past training data).

About learning the model from data, we can learn *p*(*y*|*x*), but we can also learn *p*(*y*, *x*) = *p*(*y*) *p*(*x*|*y*). We may call *p*(*y*|*x*) predictive, and *p*(*x*|*y*) (or *p*(*y*, *x*)) generative, and both may involve hidden variables *z*. The generative model can learn from data where *y* is often unavailable (the so-called semi-supervised learning). In fact, learning a generative model *p*(*y*, *z*, *x*) = *p*(*z*) *p*(*y*, *x*|*z*) is necessary for predicting *p*(*y*|*do*(*x*)). I am not sure if this is also related to the two cultures mentioned by Brieman. I once asked him (at a workshop at Banff, while enjoying some second-hand smoking) about the two models, and he actually preferred generative model, although in his talk, he also emphasized that a non-parametric predictive model such as forest is still interpretable in terms of assessing the influences of variables.

To digress a bit further, there is no such a thing called how nature works according to the Copenhagen interpretation of quantum physics: there must be an observer, the observer makes a measurement, and the wave function predicts the probability distribution of the measurement. As to the question of what happens when there is no observer or the observer is not observing, the answer is that such a question is irrelevant.

Even back to the classical regime where we can ask such a question, Ptolemy’s epicycle model on planet motion, Newton’s model of gravitation, and Einstein’s model of general relativity are not that different. Ptolemy’s model is actually more general and flexible (being a Fourier expansion, where the cycle on top of cycles is similar in style to the perceptron on top of perceptrons of neural network). Newton’s model is simpler, while Einstein’s model fits the data better (being equally simple but more involved in calculation). They are all illusions about how nature works, learned from the data, and intended to predict future data. Newton’s illusion is action at a distance (which he himself did not believe), while Einstein’s illusion is about bending of spacetime, which is more believable, but still an illusion nonetheless (to be superseded by a deeper illusion such as a string).

So Box is still right: all models are wrong, but some are useful. Useful in terms of making predictions, especially making extrapolative predictions.

Ying Nian

Dear Ying Nian,

Thanks for commenting on my “Causally Colored Reflections.”

I will start from the end of your comment, where you concur with George Box that “All models are wrong, but some are useful.” I have always felt that this aphorism is painfully true but hardly useful. As one of the most quoted aphorism in statistics, it ought to have given us some clue as to what makes one model more useful than another – it doesn’t.

A taxonomy that helps decide model usefulness should tell us (at the very least) whether a given model can answer the research question we have in mind, and where the information encoded in the model comes from. Lumping all models in one category, as in “all models are about making prediction for future observations” does not provide this information. It reminds me of Don Rubin’s statement that causal inference is just a “missing data problem” which, naturally, raises the question of what problems are NOT missing data problems, say, mathematics, chess or astrology.

In contrast, the taxonomy defined by the Ladder of Causation (see https://ucla.in/2HI2yyx): 1. Association, 2. Intervention, 3. Counterfactuals, does provide such information. Merely looking at the syntax of a model one can tell whether it can answer the target research question, and where the information supporting the model should come from, be it observational studies, experimental data, or theoretical assumptions. The main claim of the Ladder (now a theorem) is that one cannot answer questions at level i unless one has information of type i or higher. For example, there is no way to answer policy related questions unless one has experimental data or assumptions about such data. As another example, I look at what you call a generative model *p*(*y*,*z*,*x*) = *p*(*z*)*p*(*y, x*|*z*) and I can tell right away that, no matter how smart we are, it is not sufficient for predicting *p*(*y*|*do*(*x*)).

If you doubt the usefulness of this taxonomy, just examine the amount of efforts spent (and is still being spent) by the machine learning community on the so-called “transfer learning” problem. This effort has been futile because elementary inspection of the extrapolation task tells us that it cannot be accomplished using non-experimental data, shifting or not. See https://ucla.in/2N7S0K9.

In summary, unification of research problems is helpful when it facilitates the transfer of tools across problem types. Taxonomy of research problems is helpful too; for it spares us the efforts of trying the impossible, and it tells us where we should seek the information to support our models.

Thanks again for engaging in this conversation,

Judea

Dear Judea,

Thanks for the inspiring discussion. Please allow me to formulate our consensus, and I will stop at here.

Unification 1: All models are for prediction.

Unification 2: All models are for the agent to plan the action. Unification 2 is deeper than Unification 1. But Unification 1 is a good precursor.

Taxonomy 1: (a) models that predict *p*(*y*|*x*). (b) models that predict *p*(*y*|*do*(*x*)) or (c) models that can fill in Rubin’s table.

Taxonomy 2: (a) models that fit data, not necessarily make sense, only for prediction. (b) models that understand how nature works and are interpretable.

Taxonomy 1 is deeper and more precise than Taxonomy 2, thanks to the foundational work of you and Rubin. It is based on precise, well-defined, operational mathematical language and formulation.

Taxonomy 2 is useful and is often aligned with Taxonomy 1, but we need to be aware of the limitation of Taxonomy 2, which is all I want to say in my comments. Much ink has been spilled on Taxonomy 2 because of its imprecise and non-operational nature.

Ying Nian

The statement was taken as self-evident by the audience, and set the stage for a lecture on how the nature of “knowledge” can be analyzed by examining patterns of conditional probabilities in the data. Naturally, it invoked no notions such as “external world,” “theory,” “data generating process,” “cause and effect,” “agency,” or “mental constructs” because, ostensibly, these notions, too, should emerge from the data if needed. In other words, whatever concepts humans invoke in interpreting data, be their origin cultural, scientific or genetic, can be traced to, and re-derived from the original sensory experience that has endowed those concepts with survival value.

Viewed from artificial intelligence perspective, this data-centric philosophy offers an attractive, if not seductive agenda for machine learning research: In order to develop human level intelligence, we should merely trace the way our ancestors did it, and simulate both genetic and cultural evolutions on a digital machine, taking as input all the data that we can possibly collect. Taken to extremes, such agenda may inspire fairly futuristic and highly ambitious scenarios: start with a simple neural network, resembling a primitive organism (say an Amoeba), let it interact with the environment, mutate and generate offsprings; given enough time, it will eventually emerge with an Einstein’s level of intellect. Indeed, ruling out sacred scriptures and divine revelation, where else could Einstein acquire his knowledge, talents and intellect if not from the stream of raw data that has impinged upon the human race since antiquities, including of course all the sensory inputs received by more primitive organisms preceding humans.

Before asking how realistic this agenda is, let us preempt the discussion with two observations:

(1) Simulated evolution, in some form or another, is indeed the leading paradigm inspiring most machine learning researchers today, especially those engaged in connectionism, deep learning and neural networks technologies which deploy model-free, statistics-based learning strategies. The impressive success of these strategies in applications such as computer vision, voice recognition and self-driving cars has stirred up hopes in the sufficiency and unlimited potentials of these strategies, eroding, at the same time, interest in model-based approaches.

(2) The intellectual roots of the data-centric agenda are deeply grounded in the empiricist branch of Western philosophy, according to which sense-experience is the ultimate source of all our concepts and knowledge, with little or no role given to “innate ideas” and “reason” as sources of knowledge (Markie, 2017). Empiricist ideas can be traced to the ancient writings of Aristotle, but have been given prominence by the British empiricists Francis Bacon, John Locke, George Berkeley and David Hume and, more recently, by philosophers such as Charles Sanders Pierce, and William James. Modern connectionism has in fact been viewed as a Triumph of Radical Empiricism over its rationalistic rivals (Buckner 2018; Lipton, 2015). It can definitely be viewed as a testing grounds in which philosophical theories about the balance between empiricism and innateness can be submitted to experimental evaluation on digital machines.

The merits of testing philosophical theories notwithstanding, I have three major reservations about the wisdom of pursuing a radical empiricist agenda for machine learning research. I will present three arguments why empiricism should be balanced with the principles of model-based science (Pearl, 2019), in which learning is guided by two sources of information: (a) data and (b) man-made models of how data are generated.

I label the three arguments: (1) Expediency, (2) Transparency and (3) Explainability and will discuss them in turns below:

Evolution is too slow a process (Turing, 1950), since most mutations are useless if not harmful, and waiting for natural selection to distinguish and filter the useful from the useless is often un-affordable. The bulk of machine learning tasks requires speedy interpretation of, and quick reaction to new and sparse data, too sparse to allow filtering by random mutations. The outbreak of the COVID-19 pandemic is a perfect example of a situation where sparse data, arriving from unreliable and heterogeneous sources required quick interpretation and quick action, based primarily on prior models of epidemic transmission and data production (https://ucla.in/3iEDRVo). In general, machine learning technology is expected to harness a huge amount of scientific knowledge already available, combine it with whatever data can be gathered, and solve crucial societal problems in areas such as health, education, ecology and economics.

Even more importantly, scientific knowledge can speed up evolution by actively guiding the selection or filtering of data and data sources. Choosing what data to consider or what experiments to run requires hypothetical theories of what outcomes are expected from each option, and how likely they are to improve future performance. Such expectations are provided, for example, by causal models that predict both the outcomes of hypothetical manipulations as well the consequences of counterfactual undoing of past events (Pearl, 2019).

World knowledge, even if evolved spontaneously from raw data, must eventually be compiled and represented in some machine form to be of any use. The purpose of compiled knowledge is to amortize the discovery process over many inference tasks without repeating the former. The compiled representation should then facilitate an efficient production of answers to select set of decision problems, including questions on ways of gathering additional data. Some representations allow for such inferences and others do not. For example, knowledge compiled as patterns of conditional probability estimates does not allow for predicting the effect of actions or policies. (Pearl, 2019).

Knowledge compilation involves both abstraction and re-formatting. The former allows for information loss (as in the case of probability models) while the latter retains the information content and merely transform some of the information from implicit to explicit representations.

These considerations demand that we study the mathematical properties of compiled representations, their inherent limitations, the kind of inferences they support, and how effective they are in producing the answers they are expected to produce. In more concrete terms, machine learning researchers should engage in what is currently called “causal modelling” and use the tools and principles of causal science to guide data exploration and data interpretation processes.

Regardless of how causal knowledge is accumulated, discovered or stored, the inferences enabled by that knowledge are destined to be delivered to, and benefit a human user. Today, these usages include policy evaluation, personal decisions, generating explanations, assigning credit and blame or making general sense of the world around us. All inferences must therefore be cast in a language that matches the way people organize their world knowledge, namely, the language of cause and effect. It is imperative therefore that machine learning researchers regardless of the methods they deploy for data fitting, be versed in this user-friendly language, its grammar, its universal laws and the way humans interpret or misinterpret the functions that machine learning algorithms discover.

It is a mistake to equate the content of human knowledge with its sense-data origin. The format in which knowledge is stored in the mind (or on a computer) and, in particular, the balance between its implicit vs. explicit components are as important for its characterization as its content or origin.

While radical empiricism may be a valid model of the evolutionary process, it is a bad strategy for machine learning research. It gives a license to the data-centric thinking, currently dominating both statistics and machine learning cultures, according to which the secret to rational decisions lies in the data alone.

A hybrid strategy balancing “data-fitting” with “data-interpretation” better captures the stages of knowledge compilation that the evolutionary processes entails.

Buckner, C. (2018) “Deep learning: A philosophical introduction,”

Lipton, Z. (2015) “Deep Learning and the Triumph of Empiricism,” *ND Nuggets News*, July. Retrieved from: https://www.kdnuggets.com/2015/07/deep-learning-triumph-empiricism-over-theoretical-mathematical-guarantees.html.

Markie, P. (2017) “Rationalism vs. Empiricism,” *Stanford Encyclopedia of Philosophy*, https://plato.stanford.edu/entries/rationalism-empiricism/.

Pearl, J. (2019) “The Seven Tools of Causal Inference with Reflections on Machine Learning,” *Communications of ACM*, 62(3): 54-60, March, https://cacm.acm.org/magazines/2019/3/234929-the-seven-tools-of-causal-inference-with-reflections-on-machine-learning/fulltext.

Turing, A.M. (1950) I. — Computing Machinery and Intelligence,” *Mind*, LIX (236): 433-460, October, https://doi.org/10.1093/mind/LIX.236.433.

The following email exchange with Yoshua Bengio clarifies the claims and aims of the post above.**Yoshua Bengio commented Aug 3 2020 2:21 pm**

Hi Judea,

Thanks for your blog post! I have a high-level comment. I will start from your statement that “learning is guided by two sources of information: (a) data and (b) man-made models of how data are generated. ” This makes sense in the kind of setting you have often discussed in your writings, where a scientist has strong structural knowledge and wants to combine it with data in order to arrive at some structural (e.g. causal) conclusions. But there are other settings where this view leaves me wanting more. For example, think about a baby before about 3 years old, before she can gather much formal knowledge of the world (simply because her linguistic abilities are not yet developed or not enough developed, not to mention her ability to consciously reason). Or think about how a chimp develops an intuitive understanding of his environment which includes cause and effect. Or about an objective to build a robot which could learn about the world without relying on human-specified theories. Or about an AI which would have as a mission to discover new concepts and theories which go well beyond those which humans provide. In all of these cases we want to study how both statistical and causal knowledge can be (jointly) discovered. Presumably this may be from observations which include changes in distribution due to interventions (our learning agent’s or those of other agents). These observations are still data, just of a richer kind than what current purely statistical models (I mean trying to capture only joint distributions or conditional distribution) are built on. Of course, we *also* need to build learning machines which can interact with humans, understand natural language, explain their decisions (and our decisions), and take advantage of what human culture has to offer. Not taking advantage of knowledge when we have it may seem silly, but (a) our presumed knowledge is sometimes wrong or incomplete, (b) we still want to understand how pre-linguistic intelligence manages to make sense of the world (including of its causal structure), and (c) forcing us into this more difficult setting could also hasten the discovery of the learning principles required to achieve (a) and (b).

Cheers and thanks again for your participation in our recent CIFAR workshop on causality!

— Yoshua**Judea Pearl reply, August 4 5:53 am**

Hi Yoshua,

The situation you are describing: “where a scientist has strong structural knowledge and wants to combine it with data in order to arrive at some structural (e.g. causal) conclusions” motivates only the first part of my post (labeled “expediency”). But the enterprise of causal modeling brings another resource to the table. In addition to domain specific knowledge, it brings a domain-independent “template” that houses that knowledge and which is useful for precisely the “other settings” you are aiming to handle:

“a baby before about 3 years old, before she can gather much formal knowledge of the world … Or think about how a chimp develops an intuitive understanding of his environment which includes cause and effect. Or about an objective to build a robot which could learn about the world without relying on human-specified theories.”

A baby and a chimp exposed to the same stimuli will not develop the same understanding of the world, because the former starts with a richer inborn template that permits it to organize, interpret and encode the stimuli into a more effective representation. This is the role of “compiled representations” mentioned in the second part of my post. (And by “stimuli”, I include “playful manipulations”) .

In other words, the baby’s template has a richer set of blanks to be filled than the chimp’s template, which accounts for Alison Gopnik’s finding of a greater reward-neutral curiosity in the former.

The science of Causal Modeling proposes a concrete embodiment of that universal “template”. The mathematical properties of the template, its inherent limitations and inferential and algorithmic capabilities should therefore be studied by every machine learning researcher, regardless of whether she obtains it from domain expert or discovers it on her own from invariant features of the data.

Finding a needle in a haystack is difficult, and it’s close to impossible if you haven’t seen a needle before. Most ML researchers today have not seen a needle — an educational gap that needs to be corrected in order to hasten the discovery of those learning principles you aspire to uncover.

Cheers and thanks for inviting me to participate in your CIFAR workshop on causality.

— Judea**Yoshua Bengio comment Aug. 4, 7:00 am**

Agreed. What you call the ‘template’ is something I sort in the machine learning category of ‘inductive biases’ which can be fairly general and allow us to efficiently learn (and here discover representations which build a causal understanding of the world).

— Yoshua

The post below is written for the upcoming Spanish translation of The Book of Why, which was announced today. It expresses my firm belief that the current data-fitting direction taken by “Data Science” is temporary (read my lips!), that the future of “Data Science” lies in causal data interpretation and that we should prepare ourselves for the backlash swing.

Much has been said about how ill-prepared our health-care system was in coping with catastrophic outbreaks like COVID-19. Yet viewed from the corner of my expertise, the ill-preparedness can also be seen as a failure of information technology to keep track of and interpret the outpour of data that have arrived from multiple and conflicting sources, corrupted by noise and omission, some by sloppy collection and some by deliberate misreporting, AI could and should have equipped society with intelligent data-fusion technology, to interpret such conflicting pieces of information and reason its way out of the confusion.

Speaking from the perspective of causal inference research, I have been part of a team that has developed a complete theoretical underpinning for such “data-fusion” problems; a development that is briefly described in Chapter 10 of

“Data fitting” is the name I frequently use to characterize the data-centric thinking that dominates both statistics and machine learning cultures, in contrast to the “data-interpretation” thinking that guides causal inference. The data-fitting school is driven by the faith that the secret to rational decisions lies in the data itself, if only we are sufficiently clever at data mining. In contrast, the data-interpreting school views data, not as a sole object of inquiry but as an auxiliary means for interpreting reality, and “reality” stands for the processes that generate the data.

I am not alone in this assessment. Leading researchers in the “Data Science” enterprise have come to realize that machine learning as it is currently practiced cannot yield the kind of understanding that intelligent decision making requires. However, what many fail to realize is that the transition from data-fitting to data-understanding involves more than a technology transfer; it entails a profound paradigm shift that is traumatic if not impossible. Researchers whose entire productive career have committed them to the supposition that all knowledge comes from the data cannot easily transfer allegiance to a totally alien paradigm, according to which extra-data information is needed, in the form of man-made, causal models of reality. Current machine learning thinking, which some describe as “statistics on steroids,” is deeply entrenched in this self-propelled ideology.

Ten years from now, historians will be asking: How could scientific leaders of the time allow society to invest almost all its educational and financial resources in data-fitting technologies and so little on data-interpretation science?

A related, yet perhaps more critical topic that came up in handling the COVID-19 pandemic, is the issue of personalized care. Much of current health-care methods and procedures are guided by population data, obtained from controlled experiments or observational studies. However, the task of going from these data to the level of individual behavior requires counterfactual logic, which has been formalized and algorithmatized in the past 2 decades (as narrated in Chapter 8 of

The immediate area where this development could have assisted the COVID-19 pandemic predicament concerns the question of prioritizing patients who are in “greatest need” for treatment, testing, or other scarce resources. “Need” is a counterfactual notion (i.e., patients who would have gotten worse had they not been treated) and cannot be captured by statistical methods alone. A recently posted blog page https://ucla.in/39Ey8sU demonstrates in vivid colors how counterfactual analysis handles this prioritization problem.

The entire enterprise known as “personalized medicine” and, more generally, any enterprise requiring inference from populations to individuals, rests on counterfactual analysis, and AI now holds the key theoretical tools for operationalizing this analysis.

People ask me why these capabilities are not part of the standard tool sets available for handling health-care management. The answer lies again in training and education. We have been rushing too eagerly to reap the low-lying fruits of big data and data fitting technologies, at the cost of neglecting data-interpretation technologies. Data-fitting is addictive, and building more “data-science centers” only intensifies the addiction. Society is waiting for visionary leadership to balance this over-indulgence by establishing research, educational and training centers dedicated to “causal science.”

I hope it happens soon, for we must be prepared for the next pandemic outbreak and the information confusion that will probably come in its wake.

This post reports on the presence of Simpson’s paradox in the latest CDC data on coronavirus. At first glance, the data may seem to support the notion that coronavirus is especially dangerous to white, non-Hispanic people. However, when we take into account the causal structure of the data, and most importantly we think about what causal question we want to answer, the conclusion is quite different. This gives us an opportunity to emphasize a point that was perhaps not stressed enough in

**Race, COVID Mortality, and Simpson’s Paradox**

Recently I was perusing the latest data on coronavirus on the Centers for Disease Control (CDC) website. When I got to the two graphs shown below, I did a double-take.

(click on the graph to enlarge)

COVID-19 Cases and Deaths by Race and Ethnicity (CDC, 6/30/2020).

This is a lot to take in, so let me point out what shocked me. The first figure shows that 35.3 percent of diagnosed COVID cases were in “white, non-Hispanic” people. But 49.5 percent of COVID deaths occurred to people in this category. In other words, whites who have been diagnosed as COVID-positive have a 40 percent greater risk of death than non-whites or Hispanics who have been diagnosed as COVID-positive.

This, of course, is the exact opposite of what we have been hearing in the news media. (For example, Graeme Wood in *The Atlantic*: “Black people die of COVID-19 at a higher rate than white people do.”) Have we been victimized by a media deception? The answer is NO, but the explanation underscores the importance of understanding the causal structure of data and interrogating that data using properly phrased causal queries.

Let me explain, first, why the data above cannot be taken at face value. The elephant in the room is age, which is the single biggest risk factor for death due to COVID-19. Let’s look at the CDC mortality data again, but this time stratifying by age group.

Race → |
White, non-Hispanic |
Others |
||

Age ↓ |
Cases |
Deaths |
Cases |
Deaths |

0-4 |
23.9% |
53.3% |
76.1% |
46.7% |

5-17 |
19% |
9.1% |
81% |
90.9% |

18-29 |
29.8% |
18.9% |
70.2% |
81.1% |

30-39 |
26.5% |
16.4% |
73.5% |
83.6% |

40-49 |
26.5% |
16.4% |
73.5% |
83.6% |

50-64 |
36.4% |
16.4% |
63.6% |
83.6% |

65-74 |
45.9% |
40.8% |
54.1% |
59.2% |

75-84 |
55.4% |
52.1% |
44.6% |
47.9% |

85+ |
69.6% |
67.6% |
30.4% |
32.4% |

ALL AGES |
35.4% |
49.5% |
64.6% |
50.5% |

This table shows us that in every age category (except ages 0-4), whites have a lower case fatality rate than non-whites. That is, whites make up a lower percentage of deaths than cases. But when we aggregate all of the ages, whites have a higher fatality rate. The reason is simple: whites are older.

According to U.S. census data (not shown here), 9 percent of the white population in the United States is over age 75. By comparison, only 4 percent of Black people and 3 percent of Hispanic people have reached the three-quarter-century mark. People over age 75 are exactly the ones who are at greatest risk of dying from COVID (and by a wide margin). Thus the white population contains more than twice as many high-risk people as the Black population, and three times as many high-risk people as the Hispanic population.

People who have taken a course in statistics may recognize the phenomenon we have uncovered here as Simpson’s paradox. To put it most succinctly, and most paradoxically, if you tell me that you are white and COVID-positive, but do not tell me your age, I have to assume you have a higher risk of dying than your neighbor who is Black and COVID-positive. But if you do tell me your age, your risk of dying becomes less than your neighbor who is Black and COVID-positive and the same age. How can that be? Surely the act of telling me your age should not make any difference to your medical condition.

In introductory statistics courses, Simpson’s paradox is usually presented as a curiosity, but the COVID data shows that it raises a fundamental question. Which is a more accurate picture of reality? The one where I look only at the aggregate data and conclude that whites are at greater risk of dying, or the one where I break the data down by age and conclude that non-whites are at greater risk?

The general answer espoused by introductory statistics textbooks is: control for everything. If you have age data, stratify by age. If you have data on underlying medical conditions, or socioeconomic status, or anything else, stratify by those variables too.

This “one-size-fits-all” approach is misguided because it ignores the causal story behind the data. In *The Book of Why*, we look at a fictional example of a drug that is intended to prevent heart attacks by lowering blood pressure. We can summarize the causal story in a diagram:

Here blood pressure is what we call a mediator, an intervening variable through which the intervention produces its effect. We also allow for the possibility that the drug may directly influence the chances of a heart attack in other, unknown ways, by drawing an arrow directly from “Drug” to “Heart Attack.”

The diagram tells us how to interrogate the data. Because we want to know the drug’s *total effect* on the patient, through the intended route as well as other, unintended routes, we should *not* stratify the data. That is, we should not separate the experimental data into “high-blood-pressure” and “low-blood-pressure” groups. In our book, we give (fictitious) experimental data in which the drug increases the risk of heart attack among people in the low-blood-pressure group and among people in the high-blood-pressure group (presumably because of side effects). But at the same time, and most importantly, it *shifts* patients from the high-risk high-blood-pressure group into the low-risk low-blood-pressure group. Thus its *total effect* is beneficial, even though its effect on each stratum appears to be harmful.

It’s interesting to compare this fictitious example to the all-too-real COVID example, which I would argue has a very similar causal structure:

The causal arrow from “race” to “age” means that your race influences your chances of living to age 75 or older. In this diagram, Age is a mediator between Race and Death from COVID; that is, it is a mechanism through which Race acts. As we saw in the data, it’s quite a potent mechanism; in fact, it accounts for why white people who are COVID-positive die more often.

Because the two causal diagrams are the same, you might think that in the second case, too, we should not stratify the data; instead we should use the aggregate data and conclude that COVID is a disease that “discriminates” against whites.

However, this argument ignores the second key ingredient I mentioned earlier: *interrogating the data using correctly phrased causal queries*.

What is our query in this case? It’s different from what it was in the drug example. In that case, we were looking at the drug as a preventative for a heart attack. If we were to look at the COVID data in the same way, we would ask, “What is the total lifetime effect of intervening (before birth) to change a person’s race?” And yes: if we could perform that intervention, and if our *sole objective* was to prevent death from COVID, we would choose to change our race from white to non-white. The “benefit” of that intervention would be that we would *never live* to an age where we were at high risk of dying from COVID.

I’m sure you can see, without my even explaining it, that this is not the query any reasonable person would pose. “Saving” lives from COVID by making them end earlier for other reasons is not a justifiable health policy.

Thus, the query we want to interrogate the data with is not “What is the total effect?” but “What is the direct effect?” As we explain on page 312 of *The Book of Why*, this is always the query we are interested in when we talk about discrimination. If we want to know whether our health-care system discriminates against a certain ethnic group, then we want to *hold all other variables constant* that might account for the outcome, and see what is the effect of changing Race alone. In this case, that means stratifying the data by Age, and the result is that we do see evidence of discrimination. Non-whites do worse at (almost) every age. As Wood writes, “The virus knows no race or nationality; it can’t peek at your driver’s license or census form to check whether you are black. *Society* checks for it, and provides the discrimination on the virus’s behalf.”

To reiterate: The causal story here is identical to the Drug-Blood Pressure-Heart Attack example. What has changed is our query. Precision is required both in formulating the causal model, and in deciding what is the question we want to ask of it.

I wanted to place special emphasis on the query because I recently was asked to referee an article about Simpson’s paradox that missed this exact point. Of course I cannot tell you more about the author or the journal. (I don’t even know who the author is.) It was a good article overall, and I hope that it will be published with a suitable revision.

In the meantime, there is plenty of room for further exploration of the coronavirus epidemic with causal models. Undoubtedly the diagram above is too simple; unfortunately, if we make it more realistic by including more variables, we may not have any data available to interrogate. In fact, even in this case there is a huge amount of missing data: 51 percent of the COVID cases have unknown race/ethnicity, and 19 percent of the deaths. Thus, while we can learn an excellent lesson about Simpson’s paradox and some probable lessons about racial inequities, we have to present the results with some caution. Finally, I would like to draw attention to something curious in the CDC data: The case fatality rate for whites in the youngest age group, ages 0-4, is much higher than for non-whites. I don’t know how to explain this, and I would think that someone with an interest in pediatric COVID cases should investigate.

]]>

For me, David represents mainstream statistics and, the reason I find his perspective so valuable is that he does not have a stake in causality and its various formulations. Like most mainstream statisticians, he is simply curious to understand what the big fuss is all about and how to communicate differences among various approaches without taking sides.

So, I’ll let David start, and I hope you find it useful.

**Judea Pearl Interview by David Hand**

There are some areas of statistics which seem to attract controversy and disagreement, and causal modelling is certainly one of them. In an attempt to understand what all the fuss is about, I asked Judea Pearl about these differences in perspective. Pearl is a world leader in the scientific understanding of causality. He is a recipient of the AMC Turing Award (computing’s “Nobel Prize”), for “fundamental contributions to artificial intelligence through the development of a calculus for probabilistic and causal reasoning”, the David E. Rumelhart Prize for Contributions to the Theoretical Foundations of Human Cognition, and is a Fellow of the American Statistical Association.

**QUESTION 1:**

I am aware that causal modelling is a hotly contested topic, and that there are alternatives to your perspective – the work of statisticians Don Rubin and Phil Dawid spring to mind, for example. Words like counterfactual, Popperian falsifiability, potential outcomes, appear. I’d like to understand the key differences between the various perspectives, so can you tell me what are the main grounds on which they disagree?

**ANSWER 1:**

You might be surprised to hear that, despite what seems to be hotly contested debates, there are very few philosophical differences among the various “approaches.” And I put “approaches” in quotes because the differences are more among historical traditions, or “frameworks” than among scientific principles. If we compare, for example, Rubin’s potential outcome with my framework, named “Structural Causal Models” (SCM), we find that the two are logically equivalent; a theorem in one is a theorem in the other and an assumption in one can be written as an assumption in the other. This means that, starting with the same set of assumptions, every solution obtained in one can also be obtained in the other.

But logical equivalence does not means “modeling equivalence” when we consider issues such as transparency, credibility or tractability. The equations for straight lines in polar coordinates are equivalent to those in Cartesian coordinates yet are hardly manageable when it comes to calculating areas of squares or triangles.

In SCM, assumptions are articulated in the form of equations among measured variables, each asserting how one variable responds to changes in another. Graphical models are simple abstractions of those equations and, remarkably, are sufficient for answering many causal questions when applied to non-experimental data. An arrow X—>Y in a graphical model represents the capacity to respond to such changes. All causal relationships are derived mechanically from those qualitative primitives, demanding no further judgment of the modeller.

In Rubin’s framework, assumptions are expressed as conditional independencies among counterfactual variables, also known as “ignorability conditions.” The mental task of ascertaining the plausibility of such assumptions is beyond anyone’s capacity, which makes it extremely hard for researchers to articulate or to verify. For example, the task of deciding which measurements to include in the analysis (or in the propensity score) is intractable in the language of conditional ignorability. Judging whether the assumptions are compatible with the available data, is another task that is trivial in graphical models and insurmountable in the potential outcome framework.

Conceptually, the differences can be summarized thus: The graphical approach goes where scientific knowledge resides, while Rubin’s approach goes where statistical routines need to be justified. The difference shines through when simple problems are solved side by side in both approaches, as in my book Causality (2009). The main reason differences between approaches are still debated in the literature is that most statisticians are watching these debates as outsiders, instead of trying out simple examples from beginning to end. Take for example Simpson’s paradox, a puzzle that has intrigued a century of statisticians and philosophers. It is still as vexing to most statisticians today as it was to Pearson in 1889, and the task of deciding which data to consult, the aggregated or the disaggregated is still avoided by all statistics textbooks.

To summarize, causal modeling, a topic that should be of prime interest to all statisticians, is still perceived to be a “hotly contested topic”, rather than the main frontier of statistical research. The emphasis on “differences between the various perspectives” prevents statisticians from seeing the exciting new capabilities that now avail themselves, and which “enable us to answer questions that we have always wanted but were afraid to ask.” It is hard to tell whether fears of those “differences” prevent statisticians from seeing the excitement, or the other way around, and cultural inhibitions prevent statisticians from appreciating the excitement, and drive them to discuss “differences” instead.

**QUESTION 2:**

There are different schools of statistics, but I think that most modern pragmatic applied statisticians are rather eclectic, and will choose a method which has the best capability to answer their particular questions. Does the same apply to approaches to causal modelling? That is, do the different perspectives have strengths and weaknesses, and should we be flexible in our choice of approach?

**ANSWER 2:**

These strengths and weaknesses are seen clearly in the SCM framework, which unifies several approaches and provides a flexible way of leveraging the merits of each. In particular, SCM combines graphical models and potential outcome logic. The graphs are used to encode what we know (i.e., the assumptions we are willing to defend) and the logic is used to encode what we wish to know, that is, the research question of interest. Simple mathematical tools can then combine these two with data and produce consistent estimates.

The availability of these unifying tools now calls on statisticians to become actively involved in causal analysis, rather than attempting to judge approaches from a distance. The choice of approach will become obvious once research questions are asked and the stage is set to articulate subject matter information that is necessary in answering those questions.

**QUESTION 3:**

To a very great extent the modern big data revolution has been driven by so-called “databased” models and algorithms, where understanding is not necessarily relevant or even helpful, and where there is often no underlying theory about how the variables are related. Rather, the aim is simply to use data to construct a model or algorithm which will predict an outcome from input variables (deep learning neural networks being an illustration). But this approach is intrinsically fragile, relying on an assumption that the data properly represent the population of interest. Causal modelling seems to me to be at the opposite end of the spectrum: it is intrinsically “theory-based”, because it has to begin with a causal model. In your approach, described in an accessible way in your recent book The Book of Why, such models are nicely summarised by your arrow charts. But don’t theory-based models have the complementary risk that they rely heavily on the accuracy of the model? As you say on page 160 of The Book of Why, “provided the model is correct”.

**ANSWER 3:**

When the tasks are purely predictive, model-based methods are indeed not immediately necessary and deep neural networks perform surprisingly well. This is level-1 (associational) in the Ladder of Causation described in The Book of Why. In tasks involving interventions, however (level-2 of the Ladder), model-based methods become a necessity. There is no way to predict the effect of policy interventions (or treatments) unless we are in possession of either causal assumptions or controlled randomized experiments employing identical interventions. In such tasks, and absent controlled experiments, reliance on the accuracy of the model is inevitable, and the best we can do is to make the model transparent, so that its accuracy can be (1) tested for compatibility with data and/or (2) judged by experts as well as policy makers and/or (3) subjected to sensitivity analysis.

A major reason why statisticians are reluctant to state and rely on untestable modeling assumptions stems from lack of training in managing such assumptions, however plausible. Even stating such unassailable assumptions as “symptoms do not cause diseases” or “drugs do not change patient’s sex” require a vocabulary that is not familiar to the great majority of living statisticians. Things become worse in the potential outcome framework where such assumptions resist intuitive interpretation, let alone judgment of plausibility. It is important at this point to go back and qualify my assertion that causal models are not necessary for purely predictive tasks. Many tasks that, at first glance appear to be predictive, turn out to require causal analysis. A simple example is the problem of external validity or inference across populations. Differences among populations are very similar to differences induced by interventions, hence methods of transporting information from one population to another can leverage all the tools developed for predicting effects of interventions. A similar transfer applies to missing data analysis, traditionally considered a statistical problem. Not so. It is inherently a causal problem since modeling the reason for missingness is crucial for deciding how we can recover from missing data. Indeed modern methods of missing data analysis, employing causal diagrams are able to recover statistical and causal relationships that purely statistical methods have failed to recover.

**QUESTION 4:**

In a related vein, the “backdoor” and “frontdoor” adjustments and criteria described in the book are very elegant ways of extracting causal information from arrow diagrams. They permit causal information to be obtained from observational data. Provided that is, the arrow diagram accurately represents the relationships between all the relevant variables. So doesn’t valid application of this elegant calculus depends critically on the accuracy of the base diagram?

**ANSWER 4:**

Of course. But as we have agreed above, EVERY exercise in causal inference “depends critically on the accuracy” of the theoretical assumptions we make. Our choice is whether to make these assumptions transparent, namely, in a form that allows us to scrutinize their veracity, or bury those assumptions in cryptic notation that prevents scrutiny.

In a similar vein, I must modify your opening statement, which described the “backdoor” and “frontdoor” criteria as “elegant ways of extracting causal information from arrow diagrams.” A more accurate description would be “…extracting causal information from rudimentary scientific knowledge.” The diagrammatic description of these criteria enhances, rather than restricts their range of applicability. What these criteria in fact do is extract quantitative causal information from conceptual understanding of the world; arrow diagrams simply represent the extent to which one has or does not have such understanding. Avoiding graphs conceals what knowledge one has, as well as what doubts one entertains.

**QUESTION 5:**

You say, in The Book of Why (p5-6) that the development of statistics led it to focus “exclusively on how to summarise data, not on how to interpret it.” It’s certainly true that when the Royal Statistical Society was established it focused on “procuring, arranging, and publishing ‘Facts calculated to illustrate the Condition and Prospects of Society’,” and said that “the first and most essential rule of its conduct [will be] to exclude carefully all Opinions from its transactions and publications.” But that was in the 1830s, and things have moved on since then. Indeed, to take one example, clinical trials were developed in the first half of the Twentieth Century and have a history stretching back even further. The discipline might have been slow to get off the ground in tackling causal matters, but surely things have changed and a very great deal of modern statistics is directly concerned with causal matters – think of risk factors in epidemiology or manipulation in experiments, for example. So aren’t you being a little unfair to the modern discipline?

**ANSWER 5:**

Ronald Fisher’s manifesto, in which he pronounced that “the object of statistical methods is the reduction of data” was published in 1922, not in the 19th century (Fisher 1922). Data produced in clinical trials have been the only data that statisticians recognize as legitimate carriers of causal information, and our book devotes a whole chapter to this development. With the exception of this singularity, however, the bulk of mainstream statistics has been glaringly disinterested in causal matters. And I base this observation on three faithful indicators: statistics textbooks, curricula at major statistics departments, and published texts of Presidential Addresses in the past two decades. None of these sources can convince us that causality is central to statistics.

Take any book on the history of statistics, and check if it considers causal analysis to be of primary concern to the leading players in 20th century statistics. For example, Stigler’s The Seven Pillars of Statistical Wisdom (2016) barely makes a passing remark to two (hardly known) publications in causal analysis.

I am glad you mentioned epidemiologists’ analysis of risk factors as an example of modern interest in causal questions. Unfortunately, epidemiology is not representative of modern statistics. In fact epidemiology is the one field where causal diagrams have become a second language, contrary to mainstream statistics, where causal diagrams are still a taboo. (e.g., Efron and Hastie 2016; Gelman and Hill, 2007; Imbens and Rubin 2015; Witte and Witte, 2017).

When an academic colleague asks me “Aren’t you being a little unfair to our discipline, considering the work of so and so?”, my answer is “Must we speculate on what ‘so and so’ did? Can we discuss the causal question that YOU have addressed in class in the past year?” The conversation immediately turns realistic.

**QUESTION 6:**

Isn’t the notion of intervening through randomisation still the gold standard for establishing causality?

**ANSWER 6:**

It is. Although in practice, the hegemony of randomized trial is being contested by alternatives. Randomized trials suffer from incurable problems such as selection bias (recruited subject are rarely representative of the target population) and lack of transportability (results are not applicable when populations change). The new calculus of causation helps us overcome these problems, thus achieving greater over all credibility; after all, observational studies are conducted at the natural habitat of the target population.

**QUESTION 7:**

What would you say are the three most important ideas in your approach? And what, in particular, would you like readers of The Book of Why to take away from the book.

**ANSWER 7:**

The three most important ideas in the book are: (1) Causal analysis is easy, but requires causal assumptions (or experiments) and those assumptions require a new mathematical notation, and a new calculus. (2) The Ladder of Causation, consisting of (i) association (ii) interventions and (iii) counterfactuals, is the Rosetta Stone of causal analysis. To answer a question at layer (x) we must have assumptions at level (x) or higher. (3) Counterfactuals emerge organically from basic scientific knowledge and, when represented in graphs, yield transparency, testability and a powerful calculus of cause and effect. I must add a fourth take away: (4) To appreciate what modern causal analysis can do for you, solve one toy problem from beginning to end; it would tell you more about statistics and causality than dozens of scholarly articles laboring to overview statistics and causality.

**REFERENCES**

Efron, B. and Hastie, T., Computer Age Statistical Inference: Algorithms, Evidence, and Data Science, New York, NY: Cambridge University Press, 2016.

Fisher, R., “On the mathematical foundations of theoretical statistics,” Philosophical Transactions of the Royal Society of London, Series A 222, 311, 1922.

Gelman, A. and Hill, J., Data Analysis Using Regression and Multilevel/Hierarchical Models, New York: Cambridge University Press, 2007.

Imbens, G.W. and Rubin, D.B., Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction, Cambridge, MA: Cambridge University Press, 2015.

Witte, R.S. and Witte, J.S., Statistics, 11th edition, Hoboken, NJ: John Wiley & Sons, Inc. 2017.

]]>*Dear HAI Fellows,*

*I was unable to attend our virtual conference on “COVID-19 and AI”, but I feel an obligation to share with you a couple of ideas on how AI can offer new insights and new technologies to help in pandemic situations like the one we are facing.*

*I will describe them briefly below, with the hope that you can discuss them further with colleagues, students, and health-care agencies, whenever opportunities avail themselves.*

*1. Data interpreting vs. Data Fitting*

*————–*

*Much has been said about how ill-prepared our health-care system was/is to cope with catastrophic outbreaks like COVID-19. The ill-preparedness, however, was also a failure of information technology to keep track of and interpret the vast amount of data that have arrived from multiple heterogeneous sources, corrupted by noise and omission, some by sloppy collection and some by deliberate misreporting. AI is in a unique position to equip society with intelligent data-interpreting technology to cope with such situations.*

*Speaking from my narrow corner of causal inference research, a solid theoretical underpinning of this data fusion problem has been developed in the past decade (summarized in this PNAS paper* https://ucla.in/2Jc1kdD*), and is waiting to be operationalized by practicing professionals and information management organizations.*

*A system based on data fusion principles should be able to attribute disparities between Italy and China to differences in political leadership, reliability of tests and honesty in reporting, adjust for such difference and infer behavior in countries like Spain or the US. AI is in a position to develop a data-interpreting technology on top of the data-fitting technology currently in use.*

*2. Personalized care and counterfactual analysis*

*————–*

*Much of current health-care methods and procedures are guided by population data, obtained from controlled or observational studies. However, the task of going from these data to the level of individual behavior requires counterfactual logic, such as the one formalized and “algorithmitized” by AI researchers in the past three decades.*

*One area where this development can assist the COVID-19 efforts concerns the question of prioritizing patients who are in “greatest need” for treatment, testing, or other scarce resources. “Need” is a counterfactual notion (i.e., invoking iff conditionals) that cannot be captured by statistical methods alone. A recently posted blog page https://ucla.in/39Ey8sU demonstrates in vivid colors how counterfactual analysis handles this prioritization problem.*

*Going beyond priority assignment, we should keep in mind that the entire enterprise known as “personalized medicine” and, more generally, any enterprise requiring inference from populations to individuals, rests on counterfactual analysis. AI now holds the most advanced tools for operationalizing this analysis.*

*Let us add these two methodological capabilities to the ones discussed in the virtual conference on “COVID-19 and AI.” AI should prepare society to cope with the next information tsunami.*

*Best wishes,*

*Judea*

**Scott Mueller and Judea Pearl**

With COVID-19 among us, our thoughts naturally lead to people in greatest need of treatment (or test) and the scarcity of hospital beds and equipment necessary to treat those people. What does “in greatest need” mean? This is a counterfactual notion. People who are most in need have the highest probability of *both* survival if treated and death if not treated. This is materially different from the probability of survival if treated. The people who will survive if treated include those who would survive even if untreated. We want to focus treatment on people who need treatment the most, not the people who will survive regardless of treatment.

Imagine that a treatment for COVID-19 affects men and women differently. Two patients arrive in your emergency room testing positive for COVID-19, a man and a woman. Which patient is most in need of this treatment? That depends, of course, on the data we have about men and women.

A Randomized Controlled Trial (RCT) is conducted for men, and another one for women. It turns out that men recover \(57\%\) of the time when treated and only \(37\%\) of the time when not treated. Women, on the other hand, recover \(55\%\) of the time when treated and \(45\%\) of the time when not treated. We might be tempted to conclude that, since the treatment is more effective among men than women, \(20\) compared to \(10\) percentage points, that men benefit more from the treatment and, therefore, when resources are limited, men are in greater need for those resources than women. But things are not that simple, especially when treatment is suspect of causing fatal complications in some patients.

Let us examine the data for men and ask what it tells us about the number that truly *benefit* from the treatment. It turns out that the data can be interpreted in a variety of ways. In one extreme interpretation, the \(20\%\) difference between the treated and untreated amounts to saving the lives of \(20\%\) of the patients who would have died otherwise. In the second extreme interpretation, the treatment saved the lives of *all* \(57\%\) of those who recovered, and actually killed \(37\%\) of other patients; they would have recovered otherwise, as did the \(37\%\) recoveries in the control group. Thus the percentage of men saved by the treatment could be anywhere between \(20\%\) and \(57\%\), quite a sizable range.

Applying the same reasoning to the women’s data, we find an even wider range. In the first extreme interpretation, \(10\%\) out of \(55\%\) recoveries were saved by the treatment and \(45\%\) would recover anyhow. In the second extreme interpretation, all \(55\%\) of the treated recoveries were saved by the treatment while \(45\%\) were killed by it.

Summarizing, the percentage of beneficiaries may be, for men, anywhere from \(20\%\) to \(57\%\), while for women, anywhere from \(10\%\) to \(55\%\). It should start to be clear now why it’s *not* so clear that the treatment cures more men than women. Looking at the two intervals in figure 1 below, it is quite possible that as much as \(55\%\) of the women and only \(20\%\) of the men would actually benefit from the treatment.

One might be tempted to argue that men are still in greater need because the *guarantee* for curing a man is higher than that of a woman (\(20\%\) vs \(10\%\)), but that argument would neglect the other possibilities in the spectrum. For example, the possibility that exactly \(20\%\) of men benefit from the treatment and exactly \(55\%\) of women benefit, which would reverse our naive conclusion that men should be preferred.

Such coincidences may appear unlikely at first glance but we will show below that it can occur and, more remarkably, that we can determine when they occur given additional data. But first let us display the extent to which RCTs can lead us astray.

Below is an interactive plot that displays the range of possibilities for *every* RCT finding. It uses the following nomenclature. Let \(Y\) represent the outcome variable, with \(y = \text{recovery}\) and \(y’ = \text{death}\), and \(X\) represent the treatment variable, with \(x = \text{treated}\) and \(x’ = \text{not treated}\). We denote by \(y_x\) the event of recovery for a treated individual and by \(y_{x’}\) the event of recovery for an untreated individual. Similarly, \(y’_x\) and \(y’_{x’}\) represent the event of death for a treated and an untreated individual, respectively.

Going now to probabilities under experimental conditions, let us denote by \(P(y_x)\) the probability of recovery for an individual in the experimental treatment arm and by \(P(y’_{x’})\) the probability of death for an individual in the control (placebo) arm. “In need” or “cure” stands for the conjunction of the two events \(y_x\) and \(y’_{x’}\), namely, recovery upon treatment and death under no treatment. Accordingly, the probability of benefiting from treatment is equal to \(P(y_x, y’_{x’})\), i.e., the probability that an individual will recover if treated *and* die if not treated. This quantity is also known as the probability of necessity and sufficiency, denoted PNS in (Tian and Pearl, 2000) since the joint event \((y_x, y’_{x’})\) describes a treatment that is both necessary and sufficient for recovery. Another way of writing this quantity is \(P(y_x > y_{x’})\).

We are now ready to visualize these probabilities:

- \(P(y_x)\), \(P(y_{x’})\): \((0.99, 0.99)\)
- \(0.99 \leqslant P(y_x > y_{x’}) \leqslant 0.99\)
- Range: \(0\)

Let’s first see what the RCT findings above tell us about PNS (or \(P(y_x > y_{x’})\)) — the probability that the treatment benefited men and women. Click the checkbox, “Display data when hovering”. For men, \(57\%\) recovered under treatment and \(37\%\) recovered under no treatment, so hover your mouse or touch the screen where \(P(y_x)\) is \(0.57\) and \(P(y_{x’})\) is \(0.37\). The popup bubble will display \(0.2 \leqslant P(y_x > y_{x’}) \leqslant 0.57\). This means the probability of the treatment curing or benefiting men is between \(20\%\) and \(57\%\), matching our discussion above. Tracing women’s probabilities similarly yields the probability of the treatment curing or benefiting women is between \(10\%\) and \(55\%\).

We still can’t determine who is in more need of treatment, the male patient or the female patient, and naturally, we may ask whether the uncertainty in the PNS of the two groups can somehow be reduced by additional data. Remarkably, the answer is positive, if we could also observe patients’ responses under non-experimental conditions, that is, when they are given free choice on whether to undergo treatment or not. The reason why data taken under uncontrolled conditions can provide counterfactual information about individual behavior is discussed in (Pearl, 2009, Section 9.3.4). At this point we will simply display the extent to which the added data narrows the uncertainties about PNS.

Let’s assume we observe that men choose treatment \(40\%\) of the time and men never recover when they choose treatment or when they choose no treatment (men make poor choices). Click the “Observational data” checkbox and move the sliders for \(P(x)\), \(P(y|x)\), and \(P(y|x’)\) to \(0.4\), \(0\), and \(0\), respectively. Now when hovering or touching the location where \(P(y_x)\) is \(0.57\) and \(P(y_{x’})\) is \(0.37\), the popup bubble reveals \(0.57 \leqslant P(y_x > y_{x’}) \leqslant 0.57\). This tells us that exactly \(57\%\) of men will benefit from treatment.

We can also get exact results about women. Let’s assume that women choose treatment \(45\%\) of the time, and that they recover \(100\%\) of the time when they choose treatment (women make excellent choices when choosing treatment), and never recover when they choose no treatment (women make poor choices when choosing no treatment). This time move the sliders for \(P(x)\), \(P(y|x)\), and \(P(y|x’)\) to \(0.45\), \(1\), and \(0\), respectively. Clicking on the “Benefit” radio button and tracing where \(P(y_x)\) is \(0.55\) and \(P(y_{x’})\) is \(0.45\) yields the probability that women benefit from treatment as exactly \(10\%\).

We now know for sure that a man has a \(57\%\) chance of benefiting compared to \(10\%\) for women.

The display permits us to visualize the resultant (ranges of) PNS for any combination of controlled and uncontrolled data. The former characterized by the two parameters \(P(y_x)\) and \(P(y_{x’})\) and the latter by the three parameters \(P(x)\), \(P(y|x)\), and \(P(y|x’)\). Note that, in our example, different data from observational studies could have reversed our conclusion by proving that women are more likely to benefit from treatment than men. For example, if men made excellent choices when choosing treatment (\(P(y|x) = 1\)) and women made poor choices when choosing treatment (\(P(y|x) = 0\)). In this case, men would have a \(20\%\) chance of benefiting compared to \(55\%\) for women.

[[[For the curious reader, the rectangle labeled “possible region” marks experimental findings \(\{P(y_x), P(y_{x’})\}\) that are compatible with the selected observational parameters \(\{P(x), P(y|x), P(y|x’)\}\). Observations lying outside this region correspond to ill-conducted RCTs, suffering from selection bias, placebo effects, or some other imperfections (see Pearl, 2009, page 294).]]]

But even when PNS is known precisely, one may still argue that the chance of benefiting is not the only parameter we should consider in allocating hospital beds. The chance for *harming* a patient should be considered too. We can determine what percentage of people will be harmed by the treatment by clicking the “Harm” radio button at the top. This time the popup bubble will show bounds for \(P(y_x < y_{x’})\). This is the probability of harm. For our example data on men (\(P(x) = 0.4\), \(P(y|x) = 0\), and \(P(y|x’) = 0\)), trace the position where \(P(y_x)\) is \(0.57\) and \(P(y_{x’})\) is \(0.37\). You’ll see that exactly \(37\%\) of men will be harmed by the treatment. Next, we can use our example data on women, \(P(x) = 0.45\), \(P(y|x) = 1\), \(P(y|x’) = 0\), \(P(y_x) = 0.55\), and \(P(y_{x’}) = 0.45\). The probability that women are harmed by treatment is, thankfully, \(0\%\).

What do we do now? We have a conflict between benefit and harm considerations. One solution is to quantify the benefit to society for each person saved versus each person killed. Let’s say the benefit to society to treat someone who will be cured if and only if treated is \(1\) unit. However, the harm to society to treat someone who will die if and only if treated is \(2\) units. This is because we lost the opportunity to treat someone who would benefit from treatment, we killed someone, and we incurred a loss of trust from this poor decision. Now, the benefit of treatment for men is \(1 \times 0.57 – 2 \times 0.37 = -0.17\) and the benefit of treatment for women is \(1 \times 0.1 – 2 \times 0 = 0.1\). If you were a policy-maker, you would prioritize treating women. Treating men actually yields a negative benefit on society!

The above demonstrates how a decision about who is in greatest need, when based on correct counterfactual analysis, can reverse traditional decisions based solely on controlled experiments. The latter, dubbed A/B in the literature, estimates the efficacy of a treatment averaged over an entire population while the former unravels individual behavior as well. The problem of prioritizing patients for treatment demands knowledge of individual behavior under two parallel and incompatible worlds, treatment and non-treatment, and must therefore invoke counterfactual analysis. A complete analysis of counterfactual-based optimization of unit selection is presented in (Li and Pearl, 2019).

- Ang Li and Judea Pearl. Unit selection based on counterfactual logic.
*Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence*, pages 1793–1799, 2019. [Online]. Available: https://ftp.cs.ucla.edu/pub/stat_ser/r488-reprint.pdf. [Accessed April 4, 2020]. - Judea Pearl.
*Causality*. Cambridge University Press, 2009. - Jin Tian and Judea Pearl. Probabilities of causation: Bounds and identification.
*Annals of Mathematics and Artificial Intelligence*, 28:287–313, 2000. [Online]. Available: https://ftp.cs.ucla.edu/pub/stat_ser/r271-A.pdf. [Accessed April 4, 2020].

*Journal of Causal Inference*

by Judea Pearl

__Introduction__

This collection of 14 short articles represents adventurous ideas and semi-heretical thoughts that emerged when, in 2013, I was given the opportunity to edit a fun section of the *Journal of Causal Inference* called “Causal, Casual, and Curious.”

This direct contact with readers, unmediated by editors or reviewers, had a healthy liberating effect on me and has unleashed some of my best, perhaps most mischievous explorations. I thank the editors of the *Journal of Causal Inference* for giving me this opportunity to undertake this adventure and for trusting me to manage it as prudently as I could.

May 2013

“Linear Models: A Useful “Microscope” for Causal Analysis,” *Journal of Causal Inference*, 1(1): 155–170, May 2013.

Abstract: This note reviews basic techniques of linear path analysis and demonstrates, using simple examples, how causal phenomena of non-trivial character can be understood, exemplified and analyzed using diagrams and a few algebraic steps. The techniques allow for swift assessment of how various features of the model impact the phenomenon under investigation. This includes: Simpson’s paradox, case-control bias, selection bias, missing data, collider bias, reverse regression, bias amplification, near instruments, and measurement errors.

December 2013

“The Curse of Free-will and the Paradox of Inevitable Regret” *Journal of Causal Inference*, 1(2): 255-257, December 2013.

Abstract: The paradox described below aims to clarify the principles by which population data can be harnessed to guide personal decision making. The logic that permits us to infer counterfactual quantities from a combination of experimental and observational studies gives rise to situations in which an agent knows he/she will regret whatever action is taken.

March 2014

“Is Scientific Knowledge Useful for Policy Analysis? A Peculiar Theorem says: No,” *Journal of Causal Inference*, 2(1): 109–112, March 2014.

Abstract: Conventional wisdom dictates that the more we know about a problem domain the easier it is to predict the effects of policies in that domain. Strangely, this wisdom is not sanctioned by formal analysis, when the notions of “knowledge” and “policy” are given concrete definitions in the context of nonparametric causal analysis. This note describes this peculiarity and speculates on its implications.

September 2014

“Graphoids over counterfactuals” *Journal of Causal Inference*, 2(2): 243-248, September 2014.

Abstract: Augmenting the graphoid axioms with three additional rules enables us to handle independencies among observed as well as counterfactual variables. The augmented set of axioms facilitates the derivation of testable implications and ignorability conditions whenever modeling assumptions are articulated in the language of counterfactuals.

March 2015

“Conditioning on Post-Treatment Variables,” *Journal of Causal Inference*, 3(1): 131-137, March 2015. Includes Appendix (appended to published version).

Abstract: In this issue of the Causal, Casual, and Curious column, I compare several ways of extracting information from post-treatment variables and call attention to some peculiar relationships among them. In particular, I contrast do-calculus conditioning with counterfactual conditioning and discuss their interpretations and scopes of applications. These relationships have come up in conversations with readers, students and curious colleagues, so I will present them in a question–answers format.

September 2015

“Generalizing experimental findings,” *Journal of Causal Inference*, 3(2): 259-266, September 2015.

Abstract: This note examines one of the most crucial questions in causal inference: “How generalizable are randomized clinical trials?” The question has received a formal treatment recently, using a non-parametric setting, and has led to a simple and general solution. I will describe this solution and several of its ramifications, and compare it to the way researchers have attempted to tackle the problem using the language of ignorability. We will see that ignorability-type assumptions need to be enriched with structural assumptions in order to capture the full spectrum of conditions that permit generalizations, and in order to judge their plausibility in specific applications.

March 2016

“The Sure-Thing Principle,” *Journal of Causal Inference*, 4(1): 81-86, March 2016.

Abstract: In 1954, Jim Savage introduced the Sure Thing Principle to demonstrate that preferences among actions could constitute an axiomatic basis for a Bayesian foundation of statistical inference. Here, we trace the history of the principle, discuss some of its nuances, and evaluate its significance in the light of modern understanding of causal reasoning.

September 2016

“Lord’s Paradox Revisited — (Oh Lord! Kumbaya!),” *Journal of Causal Inference*, Published Online 4(2): September 2016.

Abstract: Among the many peculiarities that were dubbed “paradoxes” by well meaning statisticians, the one reported by Frederic M. Lord in 1967 has earned a special status. Although it can be viewed, formally, as a version of Simpson’s paradox, its reputation has gone much worse. Unlike Simpson’s reversal, Lord’s is easier to state, harder to disentangle and, for some reason, it has been lingering for almost four decades, under several interpretations and re-interpretations, and it keeps coming up in new situations and under new lights. Most peculiar yet, while some of its variants have received a satisfactory resolution, the original version presented by Lord, to the best of my knowledge, has not been given a proper treatment, not to mention a resolution.

The purpose of this paper is to trace back Lord’s paradox from its original formulation, resolve it using modern tools of causal analysis, explain why it resisted prior attempts at resolution and, finally, address the general methodological issue of whether adjustments for preexisting conditions is justified in group comparison applications.

March 2017

“A Linear `Microscope’ for Interventions and Counterfactuals,” *Journal of Causal Inference*, Published Online 5(1): 1-15, March 2017.

Abstract: This note illustrates, using simple examples, how causal questions of non-trivial character can be represented, analyzed and solved using linear analysis and path diagrams. By producing closed form solutions, linear analysis allows for swift assessment of how various features of the model impact the questions under investigation. We discuss conditions for identifying total and direct effects, representation and identification of counterfactual expressions, robustness to model misspecification, and generalization across populations.

September 2017

“Physical and Metaphysical Counterfactuals” Revised version, *Journal of Causal Inference*, 5(2): September 2017.

Abstract: The structural interpretation of counterfactuals as formulated in Balke and Pearl (1994a,b) [1, 2] excludes disjunctive conditionals, such as “had X been x_{1} or x_{2},” as well as disjunctive actions such as do(X = x_{1} or X = x_{2}). In contrast, the closest-world interpretation of counterfactuals (e.g. Lewis (1973a) [3]) assigns truth values to all counterfactual sentences, regardless of the logical form of the antecedent. This paper leverages “imaging”–a process of “mass-shifting” among possible worlds, to define disjunction in structural counterfactuals. We show that every imaging operation can be given an interpretation in terms of a stochastic policy in which agents choose actions with certain probabilities. This mapping, from the metaphysical to the physical, allows us to assess whether metaphysically-inspired extensions of interventional theories are warranted in a given decision making situation.

March 2018

“What is Gained from Past Learning” *Journal of Causal Inference*, 6(1), Article 20180005, https://doi.org/10.1515/jci-2018-0005, March 2018.

Abstract: We consider ways of enabling systems to apply previously learned information to novel situations so as to minimize the need for retraining. We show that theoretical limitations exist on the amount of information that can be transported from previous learning, and that robustness to changing environments depends on a delicate balance between the relations to be learned and the causal structure of the underlying model. We demonstrate by examples how this robustness can be quantified.

September 2018

“Does Obesity Shorten Life? Or is it the Soda? On Non-manipulable Causes,” *Journal of Causal Inference*, 6(2), online, September 2018.

Abstract: Non-manipulable factors, such as gender or race have posed conceptual and practical challenges to causal analysts. On the one hand these factors do have consequences, and on the other hand, they do not fit into the experimentalist conception of causation. This paper addresses this challenge in the context of public debates over the health cost of obesity, and offers a new perspective, based on the theory of Structural Causal Models (SCM).

March 2019

“On the interpretation of do(x),” *Journal of Causal Inference*, 7(1), online, March 2019.

Abstract: This paper provides empirical interpretation of the *do(x)* operator when applied to non-manipulable variables such as race, obesity, or cholesterol level. We view *do(x)* as an ideal intervention that provides valuable information on the effects of manipulable variables and is thus empirically testable. We draw parallels between this interpretation and ways of enabling machines to learn effects of untried actions from those tried. We end with the conclusion that researchers need not distinguish manipulable from non-manipulable variables; both types are equally eligible to receive the *do(x)* operator and to produce useful information for decision makers.

September 2019

“Sufficient Causes: On Oxygen, Matches, and Fires,” *Journal of Causal Inference*, AOP, https://doi.org/10.1515/jci-2019-0026, September 2019.

Abstract: We demonstrate how counterfactuals can be used to compute the probability that one event was/is a sufficient cause of another, and how counterfactuals emerge organically from basic scientific knowledge, rather than manipulative experiments. We contrast this demonstration with the potential outcome framework and address the distinction between causes and enablers.

]]>

The note below offers brief comments on Imbens’s five major claims regarding the superiority of potential outcomes [PO] vis a vis directed acyclic graphs [DAGs].

These five claims are articulated in Imbens’s introduction (pages 1-3). [Quoting]:

” … there are five features of the PO framework that may be behind its current popularity in economics.”

I will address them sequentially, first quoting Imbens’s claims, then offering my counterclaims.

I will end with a comment on Imbens’s final observation, concerning the absence of empirical evidence in a “realistic setting” to demonstrate the merits of the DAG approach.

Before we start, however, let me clarify that there is no such thing as a “DAG approach.” Researchers using DAGs follow an approach called Structural Causal Model (SCM), which consists of functional relationships among variables of interest, and of which DAGs are merely a qualitative abstraction, spelling out the arguments in each function. The resulting graph can then be used to support inference tools such as d-separation and do-calculus. Potential outcomes are relationships *derived* from the structural model and several of their properties can be elucidated using DAGs. These interesting relationships are summarized in chapter 7 of (Pearl, 2009a) and in a Statistical Survey overview (Pearl, 2009c)

Imbens’s Claim # 1*“First, there are some assumptions that are easily captured in the PO framework relative to the DAG approach, and these assumptions are critical in many identification strategies in economics. Such assumptions include**monotonicity ([Imbens and Angrist, 1994]) and other shape restrictions such as convexity or concavity ([Matzkin et al.,1991, Chetverikov, Santos, and Shaikh, 2018, Chen, Chernozhukov, Fernández-Val, Kostyshak, and Luo, 2018]). The instrumental variables setting is a prominent example, and I will discuss it in detail in Section 4.2.”*

Pearl’s Counterclaim # 1

It is logically impossible for an assumption to be “easily captured in the PO framework” and not simultaneously be “easily captured” in the “DAG approach.” The reason is simply that the latter embraces the former and merely enriches it with graph-based tools. Specifically, SCM embraces the counterfactual notation *Y _{x}* that PO deploys, and does not exclude any concept or relationship definable in the PO approach.

Take monotonicity, for example. In PO, monotonicity is expressed as

*Y _{x}* (

In the DAG approach it is expressed as:

*Y _{x}* (

(Taken from Causality pages 291, 294, 398.)

The two are identical, of course, which may seem surprising to PO folks, but not to DAG folks who know how to derive the counterfactuals *Y _{x }*from structural models. In fact, the derivation of counterfactuals in

terms of structural equations (Balke and Pearl, 1994) is considered one of the fundamental laws of causation in the SCM framework see (Bareinboim and Pearl, 2016) and (Pearl, 2015).

Imbens’s Claim # 2

Pearl’s Counterclaim #2

Not so. The term “potential outcome” is a late comer to the economics literature of the 20th century, whose native vocabulary and natural primitives were functional relationships among variables, not potential outcomes. The latters are defined in terms of a “treatment assignment” and hypothetical outcome, while the formers invoke only observable variables like “supply” and “demand”. Don Rubin cited this fundamental difference as sufficient reason for shunning structural equation models, which he labeled “bad science.”

While it is possible to give PO interpretation to structural equations, the interpretation is both artificial and convoluted, especially in view of PO insistence on manipulability of causes. Haavelmo, Koopman and Marschak would not hesitate for a moment to write the structural equation:

*Damage = f (earthquake intensity, other factors).*

PO researchers, on the other hand, would spend weeks debating whether earthquakes have “treatment assignments” and whether we can legitimately estimate the “causal effects” of earthquakes. Thus, what Imbens perceives as a helpful distinction is, in fact, an unnecessary restriction that suppresses natural scientific discourse. See also (Pearl, 2018; 2019).

Imbens’s Claim #3*“Third, many of the currently popular identification strategies focus on **models with relatively few (sets of) variables, where identification **questions have been worked out once and for all.”*

Pearl’s Counterclaim #3

First, I would argue that this claim is actually false. Most IV strategies that economists use are valid “conditional on controls” (see examples listed in Imbens (2014)) and the criterion that distinguishes “good controls” from “bad controls” is not trivial to articulate without the help of graphs. (See, A Crash Course in Good and Bad Control). It can certainly not be discerned “once and for all”.

Second, even if economists are lucky to guess “good controls,” it is still unclear whether they focus on relatively few variables because, lacking graphs, they cannot handle more variables, or do they refrain from using graphs to hide the opportunities missed by focusing on few pre-fabricated, “once and for all” identification strategies.

I believe both apprehensions play a role in perpetuating the graph-avoiding subculture among economists. I have elaborated on this question here: (Pearl, 2014).

Imbens’s Claim # 4*“Fourth, the PO framework lends itself well to accounting for treatment **effect heterogeneity in estimands ([Imbens and Angrist, 1994, Sekhon and **Shem-Tov, 2017]) and incorporating such heterogeneity in estimation and the design of optimal policy functions ([Athey and Wager, 2017, Athey, **Tibshirani, Wager, et al., 2019, Kitagawa and Tetenov, 2015]).”*

Pearl’s Counterclaim #4

Indeed, in the early 1990s, economists felt ecstatic liberating themselves from the linear tradition of structural equation models and finding a framework (PO) that allowed them to model treatment effect heterogeneity.

However, whatever role treatment heterogeneity played in this excitement should have been amplified ten-fold in 1995, when completely non parametric structural equation models came into being, in which non-linear interactions and heterogeneity were assumed a priori. Indeed, the tools developed in the econometric literature cover only a fraction of the treatment-heterogeneity tasks that are currently managed by SCM. In particular, the latter includes such problems as “necessary and sufficient” causation, mediation, external validity, selection bias and more.

Speaking more generally, I find it odd for a discipline to prefer an “approach” that rejects tools over one that invites and embraces tools.

Imbens’s claim #5*“Fifth, the PO approach has traditionally connected well with design, **estimation, and inference questions. From the outset Rubin and his coauthors provided much guidance to researchers and policy makers for practical implementation including inference, with the work on the propensity score ([Rosenbaum and Rubin, 1983b]) an influential example.”*

Pearl’s Counterclaim #5

The initial work of Rubin and his co-authors has indeed provided much needed guidance to researchers and policy makers who were in a state of desperation, having no other mathematical notation to express causal questions of interest. That happened because economists were not aware of the counterfactual content of structural equation models, and of the non-parametric extension of those models.

Unfortunately, the clumsy and opaque notation introduced in this initial work has become a ritual in the PO framework that has prevailed, and the refusal to commence the analysis with meaningful assumptions has led to several blunders and misconceptions. One such misconception has been propensity score analysis which researchers have taken as a tool for reducing confounding bias. I have elaborated on this misguidance in *Causality*, Section 11.3.5, “Understanding Propensity Scores” (Pearl, 2009a).

Imbens’s final observation: Empirical Evidence *“Separate from the theoretical merits of the two approaches, another reason for the lack of adoption in economics is that the DAG literature has not shown much evidence of the benefits for empirical practice in settings that are important in economics. The potential outcome studies in MACE, and the chapters in [Rosenbaum, 2017], CISSB and MHE have detailed empirical examples of the various identification strategies proposed. In realistic settings they demonstrate the merits of the proposed methods and describe in detail the corresponding estimation and inference methods. In contrast in the DAG literature, TBOW, [Pearl, 2000], and [Peters, Janzing, and Schölkopf, 2017] have no substantive empirical examples, focusing largely on identification questions in what TBOW refers to as “toy” models. Compare the lack of impact of the DAG literature in economics with the recent embrace of regression discontinuity designs imported from the psychology literature, or with the current rapid spread of the machine learning methods from computer science, or the recent quick adoption of synthetic control methods [Abadie, Diamond, and Hainmueller, 2010]. All came with multiple concrete examples that highlighted their benefits over traditional methods. In the absence of such concrete examples the toy models in the DAG literature sometimes appear to be a set of solutions in search of problems, rather than a set of solutions for substantive problems previously posed in social sciences.”*

Pearl’s comments on: Empirical Evidence

There is much truth to Imbens’s observation. The PO excitement that swept natural experimentalists in the 1990s came with outright rejection of graphical models. The hundreds, if not thousands, of empirical economists who plunged into empirical work, were warned repeatedly that graphical models may be “ill-defined,” “deceptive,” and “confusing,” and structural models have no scientific underpinning (see (Pearl, 1995; 2009b)). Not a single paper in the econometric literature has acknowledged the existence of SCM as an alternative or complementary approach to PO.

The result has been the exact opposite of what has taken place in epidemiology where DAGs became a second language to both scholars and field workers, [Due in part to the influential 1999 paper by Greenland, Pearl and Robins.] In contrast, PO-led economists have launched a massive array of experimental programs lacking graphical tools for guidance. I would liken it to a Phoenician armada exploring the Atlantic coast in leaky boats and no compass to guide its way.

This depiction might seem pretentious and overly critical, considering the pride with which natural experimentalists take in the results of their studies (though no objective verification of validity can be undertaken.) Yet looking back at the substantive empirical examples listed by Imbens, one cannot but wonder how much more credible those studies could have been with graphical tools to guide the way. These include a friendly language to communicate assumptions, powerful means to test their implications, and ample opportunities to uncover new natural experiments (Brito and Pearl, 2002).

Summary and Recommendation

The thrust of my reaction to Imbens’s article is simple:

*It is unreasonable to prefer an “approach” that rejects tools over one that invites and embraces tools.*

Technical comparisons of the PO and SCM approaches, using concrete examples, have been published since 1993 in dozens of articles and books in computer science, statistics, epidemiology, and social science, yet none in the econometric literature. Economics students are systematically deprived of even the most elementary graphical tools available to other researchers, for example, to determine if one variable is independent of another given a third, or if a variable is a valid IV given a set *S* of observed variables.

This avoidance can no longer be justified by appealing to “We have not found this [graphical] approach to aid the drawing of causal inferences” (Imbens and Rubin, 2015, page 25).

To open an effective dialogue and a genuine comparison between the two approaches, I call on Professor Imbens to assume leadership in his capacity as Editor in Chief of *Econometrica* and invite a comprehensive survey paper on graphical methods for the front page of his Journal. This is how creative editors move their fields forward.

Brito, C. and Pearl, J. “General instrumental variables,” In A. Darwiche and N. Friedman (Eds.), Uncertainty in Artificial Intelligence, *Proceedings of **the Eighteenth Conference*, Morgan Kaufmann: San Francisco, CA, 85-93, August 2002.

Bareinboim, E. and Pearl, J. “Causal inference and the data-fusion problem,” *Proceedings of the National Academy of Sciences*, 113(27): 7345-7352, 2016.

Greenland, S., Pearl, J., and Robins, J. “Causal diagrams for epidemiologic research,” *Epidemiology,* Vol. 1, No. 10, pp. 37-48, January 1999.

Pearl, J. “Causal diagrams for empirical research,” (With Discussions), *Biometrika*, 82(4): 669-710, 1995.

Pearl, J. “Understanding Propensity Scores” in J. Pearl’s *Causality: Models, **Reasoning, and Inference*, Section 11.3.5, Second edition, NY: Cambridge University Press, pp. 348-352, 2009a.

Pearl, J. “Myth, confusion, and science in causal analysis,” University of California, Los Angeles, Computer Science Department, Technical Report R-348, May 2009b.

Pearl, J. “Causal inference in statistics: An overview” Statistics Surveys, Vol. 3, 96–146, 2009c.

Pearl, J. “Are economists smarter than epidemiologists? (Comments on Imbens’s recent paper),” *Causal Analysis in Theory and Practice Blog*, October 27, 2014.

Pearl, J. “Trygve Haavelmo and the Emergence of Causal Calculus,” *Econometric Theory*, 31: 152-179, 2015.

Pearl, J. “Does obesity shorten life? Or is it the Soda? On non-manipulable causes,” *Journal of Causal Inference*, Causal, Casual, and Curious Section, 6(2), online, September 2018.

Pearl, J. “On the interpretation of do(x),” *Journal of Causal Inference*, Causal, Casual, and Curious Section, 7(1), online, March 2019.