Causal Analysis in Theory and Practice

July 26, 2020

Radical Empiricism and Machine Learning Research

Filed under: Causal models,Knowledge representation,Machine learning — judea @ 7:02 pm

A speaker at a lecture that I have attended recently summarized the philosophy of machine learning this way: “All knowledge comes from observed data, some from direct sensory experience and some from indirect experience, transmitted to us either culturally or genetically.”

The statement was taken as self-evident by the audience, and set the stage for a lecture on how the nature of “knowledge” can be analyzed by examining patterns of conditional probabilities in the data. Naturally, it invoked no notions such as “external world,” “theory,” “data generating process,” “cause and effect,” “agency,” or “mental constructs” because, ostensibly, these notions, too, should emerge from the data if needed. In other words, whatever concepts humans invoke in interpreting data, be their origin cultural, scientific or genetic, can be traced to, and re-derived from the original sensory experience that has endowed those concepts with survival value.

Viewed from artificial intelligence perspective, this data-centric philosophy offers an attractive, if not seductive agenda for machine learning research: In order to develop human level intelligence, we should merely trace the way our ancestors did it, and simulate both genetic and cultural evolutions on a digital machine, taking as input all the data that we can possibly collect. Taken to extremes, such agenda may inspire fairly futuristic and highly ambitious scenarios: start with a simple neural network, resembling a primitive organism (say an Amoeba), let it interact with the environment, mutate and generate offsprings; given enough time, it will eventually emerge with an Einstein’s level of intellect. Indeed, ruling out sacred scriptures and divine revelation, where else could Einstein acquire his knowledge, talents and intellect if not from the stream of raw data that has impinged upon the human race since antiquities, including of course all the sensory inputs received by more primitive organisms preceding humans.

Before asking how realistic this agenda is, let us preempt the discussion with two observations:

(1) Simulated evolution, in some form or another, is indeed the leading paradigm inspiring most machine learning researchers today, especially those engaged in connectionism, deep learning and neural networks technologies which deploy model-free, statistics-based learning strategies. The impressive success of these strategies in applications such as computer vision, voice recognition and self-driving cars has stirred up hopes in the sufficiency and unlimited potentials of these strategies, eroding, at the same time, interest in model-based approaches.

(2) The intellectual roots of the data-centric agenda are deeply grounded in the empiricist branch of Western philosophy, according to which sense-experience is the ultimate source of all our concepts and knowledge, with little or no role given to “innate ideas” and “reason” as sources of knowledge (Markie, 2017). Empiricist ideas can be traced to the ancient writings of Aristotle, but have been given prominence by the British empiricists Roger Bacon, John Locke, George Berkeley and David Hume and, more recently, by philosophers such as Charles Sanders Pierce, and William James. Modern connectionism has in fact been viewed as a Triumph of Radical Empiricism over its rationalistic rivals (Buckner 2018; Lipton, 2015). It can definitely be viewed as a testing grounds in which philosophical theories about the balance between empiricism and innateness can be submitted to experimental evaluation on digital machines.

The merits of testing philosophical theories notwithstanding, I have three major reservations about the wisdom of pursuing a radical empiricist agenda for machine learning research.  I will present three arguments why empiricism should be balanced with the principles of model-based science (Pearl, 2019), in which learning is guided by two sources of information: (a) data and (b) man-made models of how data are generated.  

I label the three arguments: (1) Expediency, (2) Transparency and (3) Explainability and will discuss them in turns below:

1. Expediency
Evolution is too slow a process (Turing, 1950), since most mutations are useless if not harmful, and waiting for natural selection to distinguish and filter the useful from the useless is often un-affordable. The bulk of machine learning tasks requires speedy interpretation of, and quick reaction to new and sparse data, too sparse to allow filtering by random mutations. The outbreak of the COVID-19 pandemic is a perfect example of a situation where sparse data, arriving from unreliable and heterogeneous sources required quick interpretation and quick action, based primarily on prior models of epidemic transmission and data production ( In general, machine learning technology is expected to harness a huge amount of scientific knowledge already available, combine it with whatever data can be gathered, and solve crucial societal problems in areas such as health, education, ecology and economics.

Even more importantly, scientific knowledge can speed up evolution by actively guiding the selection or filtering of data and data sources. Choosing what data to consider or what experiments to run requires hypothetical theories of what outcomes are expected from each option, and how likely they are to improve future performance. Such expectations are provided, for example, by causal models that predict both the outcomes of hypothetical manipulations as well the consequences of counterfactual undoing of past events (Pearl, 2019).

2. Transparency
World knowledge, even if evolved spontaneously from raw data, must eventually be compiled and represented in some machine form to be of any use. The purpose of compiled knowledge is to amortize the discovery process over many inference tasks without repeating the former. The compiled representation should then facilitate an efficient production of answers to select set of decision problems, including questions on ways of gathering additional data. Some representations allow for such inferences and others do not. For example, knowledge compiled as patterns of conditional probability estimates does not allow for predicting the effect of actions or policies. (Pearl, 2019).

Knowledge compilation involves both abstraction and re-formatting. The former allows for information loss (as in the case of probability models) while the latter retains the information content and merely transform some of the information from implicit to explicit representations.

These considerations demand that we study the mathematical properties of compiled representations, their inherent limitations, the kind of inferences they support, and how effective they are in producing the answers they are expected to produce. In more concrete terms, machine learning researchers should engage in what is currently called “causal modelling” and use the tools and principles of causal science to guide data exploration and data interpretation processes.

3. Explainability
Regardless of how causal knowledge is accumulated, discovered or stored, the inferences enabled by that knowledge are destined to be delivered to, and benefit a human user. Today, these usages include policy evaluation, personal decisions, generating explanations, assigning credit and blame or making general sense of the world around us. All inferences must therefore be cast in a language that matches the way people organize their world knowledge, namely, the language of cause and effect. It is imperative therefore that machine learning researchers regardless of the methods they deploy for data fitting, be versed in this user-friendly language, its grammar, its universal laws and the way humans interpret or misinterpret the functions that machine learning algorithms discover.

It is a mistake to equate the content of human knowledge with its sense-data origin. The format in which knowledge is stored in the mind (or on a computer) and, in particular, the balance between its implicit vs. explicit components are as important for its characterization as its content or origin.  

While radical empiricism may be a valid model of the evolutionary process, it is a bad strategy for machine learning research. It gives a license to the data-centric thinking, currently dominating both statistics and machine learning cultures, according to which the  secret to rational decisions lies in the data alone.

A hybrid strategy balancing “data-fitting” with “data-interpretation” better captures the stages of knowledge compilation that the evolutionary processes entails.

Buckner, C. (2018) “Deep learning: A philosophical introduction,” Philosophy Compass,

Lipton, Z. (2015) “Deep Learning and the Triumph of Empiricism,” ND Nuggets News, July. Retrieved from:

Markie, P. (2017) “Rationalism vs. Empiricism,” Stanford Encyclopedia of Philosophy,

Pearl, J. (2019) “The Seven Tools of Causal Inference with Reflections on Machine Learning,” Communications of ACM, 62(3): 54-60, March,

Turing, A.M. (1950) I. — Computing Machinery and Intelligence,” Mind, LIX (236): 433-460, October,

The following email exchange with Yoshua Bengio clarifies the claims and aims of the post above.

Yoshua Bengio commented Aug 3 2020 2:21 pm

Hi Judea,

Thanks for your blog post! I have a high-level comment. I will start from your statement that “learning is guided by two sources of information: (a) data and (b) man-made models of how data are generated. ” This makes sense in the kind of setting you have often discussed in your writings, where a scientist has strong structural knowledge and wants to combine it with data in order to arrive at some structural (e.g. causal) conclusions. But there are other settings where this view leaves me wanting more. For example, think about a baby before about 3 years old, before she can gather much formal knowledge of the world (simply because her linguistic abilities are not yet developed or not enough developed, not to mention her ability to consciously reason). Or think about how a chimp develops an intuitive understanding of his environment which includes cause and effect. Or about an objective to build a robot which could learn about the world without relying on human-specified theories. Or about an AI which would have as a mission to discover new concepts and theories which go well beyond those which humans provide. In all of these cases we want to study how both statistical and causal knowledge can be (jointly) discovered. Presumably this may be from observations which include changes in distribution due to interventions (our learning agent’s or those of other agents). These observations are still data, just of a richer kind than what current purely statistical models (I mean trying to capture only joint distributions or conditional distribution) are built on. Of course, we *also* need to build learning machines which can interact with humans, understand natural language, explain their decisions (and our decisions), and take advantage of what human culture has to offer. Not taking advantage of knowledge when we have it may seem silly, but (a) our presumed knowledge is sometimes wrong or incomplete, (b) we still want to understand how pre-linguistic intelligence manages to make sense of the world (including of its causal structure), and (c) forcing us into this more difficult setting could also hasten the discovery of the learning principles required to achieve (a) and (b).

Cheers and thanks again for your participation in our recent CIFAR workshop on causality!

— Yoshua

Judea Pearl reply, August 4 5:53 am

Hi Yoshua,
The situation you are describing: “where a scientist has strong structural knowledge and wants to combine it with data in order to arrive at some structural (e.g. causal) conclusions” motivates only the first part of my post (labeled “expediency”). But the enterprise of causal modeling brings another resource to the table. In addition to domain specific knowledge, it brings a domain-independent “template” that houses that knowledge and which is useful for precisely the “other settings” you are aiming to handle:

“a baby before about 3 years old, before she can gather much formal knowledge of the world … Or think about how a chimp develops an intuitive understanding of his environment which includes cause and effect. Or about an objective to build a robot which could learn about the world without relying on human-specified theories.”

A baby and a chimp exposed to the same stimuli will not develop the same understanding of the world, because the former starts with a richer inborn template that permits it to organize, interpret and encode the stimuli into a more effective representation. This is the role of “compiled representations” mentioned in the second part of my post. (And by “stimuli”, I include “playful manipulations”) .

In other words, the baby’s template has a richer set of blanks to be filled than the chimp’s template, which accounts for Alison Gopnik’s finding of a greater reward-neutral curiosity in the former.

The science of Causal Modeling proposes a concrete embodiment of that universal “template”. The mathematical properties of the template, its inherent limitations and  inferential and algorithmic capabilities should therefore be studied by every machine learning researcher, regardless of whether she obtains it from domain expert or discovers it on her own from invariant features of the data.

Finding a needle in a haystack is difficult, and it’s close to impossible if you haven’t seen a needle before. Most ML researchers today have not seen a needle — an educational gap that needs to be corrected in order to hasten the discovery of those learning principles you aspire to uncover.

Cheers and thanks for inviting me to participate in your CIFAR workshop on causality.

— Judea

Yoshua Bengio comment Aug. 4, 7:00 am

Agreed. What you call the ‘template’ is something I sort in the machine learning category of ‘inductive biases’ which can be fairly general and allow us to efficiently learn (and here discover representations which build a causal understanding of the world).

— Yoshua

July 7, 2020

Data versus Science: Contesting the Soul of Data-Science

Filed under: Book (J Pearl),Counterfactual,Data Fusion — judea @ 1:02 pm

The post below is written for the upcoming Spanish translation of The Book of Why, which was announced today. It expresses my firm belief that the current data-fitting direction taken by “Data Science” is temporary (read my lips!), that the future of “Data Science” lies in causal data interpretation and that we should prepare ourselves for the backlash swing.

Data versus Science: Contesting the Soul of Data-Science
Much has been said about how ill-prepared our health-care system was in coping with catastrophic outbreaks like COVID-19. Yet viewed from the corner of my expertise, the ill-preparedness can also be seen as a failure of information technology to keep track of and interpret the outpour of data that have arrived from multiple and conflicting sources, corrupted by noise and omission, some by sloppy collection and some by deliberate misreporting, AI could and should have equipped society with intelligent data-fusion technology, to interpret such conflicting pieces of information and reason its way out of the confusion.

Speaking from the perspective of causal inference research, I have been part of a team that has developed a complete theoretical underpinning for such “data-fusion” problems; a development that is briefly described in Chapter 10 of The Book of Why. A system based on data fusion principles should be able to attribute disparities between Italy and China to differences in political leadership, reliability of tests and honesty in reporting, adjust for such differences and automatically infer behavior in countries like Spain or the US. AI is in a position to to add such data-interpreting capabilities on top of the data-fitting technologies currently in use and, recognizing that data are noisy, filter the noise and outsmart the noise makers.

“Data fitting” is the name I frequently use to characterize the data-centric thinking that dominates both statistics and machine learning cultures, in contrast to the “data-interpretation” thinking that guides causal inference. The data-fitting school is driven by the faith that the secret to rational decisions lies in the data itself, if only we are sufficiently clever at data mining. In contrast, the data-interpreting school views data, not as a sole object of inquiry but as an auxiliary means for interpreting reality, and “reality” stands for the processes that generate the data.

I am not alone in this assessment. Leading researchers in the “Data Science” enterprise have come to realize that machine learning as it is currently practiced cannot yield the kind of understanding that intelligent decision making requires. However, what many fail to realize is that the transition from data-fitting to data-understanding involves more than a technology transfer; it entails a profound paradigm shift that is traumatic if not impossible. Researchers whose entire productive career have committed them to the supposition that all knowledge comes from the data cannot easily transfer allegiance to a totally alien paradigm, according to which extra-data information is needed, in the form of man-made, causal models of reality. Current machine learning thinking, which some describe as “statistics on steroids,” is deeply entrenched in this self-propelled ideology.

Ten years from now, historians will be asking: How could scientific leaders of the time allow society to invest almost all its educational and financial resources in data-fitting technologies and so little on data-interpretation science? The Book of Why attempts to answer this dilemma by drawing parallels to historically similar situations where ideological impediments held back scientific progress. But the true answer, and the magnitude of its ramifications, will only be unravelled by in-depth archival studies of the social, psychological and economical forces that are currently governing our scientific institutions.

A related, yet perhaps more critical topic that came up in handling the COVID-19 pandemic, is the issue of personalized care. Much of current health-care methods and procedures are guided by population data, obtained from controlled experiments or observational studies. However, the task of going from these data to the level of individual behavior requires counterfactual logic, which has been formalized and algorithmatized in the past 2 decades (as narrated in Chapter 8 of The Book of Why), and is still a mystery to most machine learning researchers.

The immediate area where this development could have assisted the COVID-19 pandemic predicament concerns the question of prioritizing patients who are in “greatest need” for treatment, testing, or other scarce resources. “Need” is a counterfactual notion (i.e., patients who would have gotten worse had they not been treated) and cannot be captured by statistical methods alone. A recently posted blog page demonstrates in vivid colors how counterfactual analysis handles this prioritization problem.

The entire enterprise known as “personalized medicine” and, more generally, any enterprise requiring inference from populations to individuals, rests on counterfactual analysis, and AI now holds the key theoretical tools for operationalizing this analysis.

People ask me why these capabilities are not part of the standard tool sets available for handling health-care management. The answer lies again in training and education. We have been rushing too eagerly to reap the low-lying fruits of big data and data fitting technologies, at the cost of neglecting data-interpretation technologies. Data-fitting is addictive, and building more “data-science centers” only intensifies the addiction. Society is waiting for visionary leadership to balance this over-indulgence by establishing research, educational and training centers dedicated to “causal science.”

I hope it happens soon, for we must be prepared for the next pandemic outbreak and the information confusion that will probably come in its wake.

July 6, 2020

Race, COVID Mortality, and Simpson’s Paradox (by Dana Mackenzie)

Filed under: Simpson's Paradox — judea @ 1:11 pm


This post reports on the presence of Simpson’s paradox in the latest CDC data on coronavirus. At first glance, the data may seem to support the notion that coronavirus is especially dangerous to white, non-Hispanic people. However, when we take into account the causal structure of the data, and most importantly we think about what causal question we want to answer, the conclusion is quite different. This gives us an opportunity to emphasize a point that was perhaps not stressed enough in The Book of Why, namely that formulation of the right query is just as important as constructing the right causal model.

Race, COVID Mortality, and Simpson’s Paradox

Recently I was perusing the latest data on coronavirus on the Centers for Disease Control (CDC) website. When I got to the two graphs shown below, I did a double-take.

(click on the graph to enlarge)

COVID-19 Cases and Deaths by Race and Ethnicity (CDC, 6/30/2020).

This is a lot to take in, so let me point out what shocked me. The first figure shows that 35.3 percent of diagnosed COVID cases were in “white, non-Hispanic” people. But 49.5 percent of COVID deaths occurred to people in this category. In other words, whites who have been diagnosed as COVID-positive have a 40 percent greater risk of death than non-whites or Hispanics who have been diagnosed as COVID-positive.

This, of course, is the exact opposite of what we have been hearing in the news media. (For example, Graeme Wood in The Atlantic: “Black people die of COVID-19 at a higher rate than white people do.”) Have we been victimized by a media deception? The answer is NO, but the explanation underscores the importance of understanding the causal structure of data and interrogating that data using properly phrased causal queries.

Let me explain, first, why the data above cannot be taken at face value. The elephant in the room is age, which is the single biggest risk factor for death due to COVID-19. Let’s look at the CDC mortality data again, but this time stratifying by age group.

Race →

White, non-Hispanic


Age ↓























































This table shows us that in every age category (except ages 0-4), whites have a lower case fatality rate than non-whites. That is, whites make up a lower percentage of deaths than cases. But when we aggregate all of the ages, whites have a higher fatality rate. The reason is simple: whites are older.

According to U.S. census data (not shown here), 9 percent of the white population in the United States is over age 75. By comparison, only 4 percent of Black people and 3 percent of Hispanic people have reached the three-quarter-century mark. People over age 75 are exactly the ones who are at greatest risk of dying from COVID (and by a wide margin). Thus the white population contains more than twice as many high-risk people as the Black population, and three times as many high-risk people as the Hispanic population.

People who have taken a course in statistics may recognize the phenomenon we have uncovered here as Simpson’s paradox. To put it most succinctly, and most paradoxically, if you tell me that you are white and COVID-positive, but do not tell me your age, I have to assume you have a higher risk of dying than your neighbor who is Black and COVID-positive. But if you do tell me your age, your risk of dying becomes less than your neighbor who is Black and COVID-positive and the same age. How can that be? Surely the act of telling me your age should not make any difference to your medical condition.

In introductory statistics courses, Simpson’s paradox is usually presented as a curiosity, but the COVID data shows that it raises a fundamental question. Which is a more accurate picture of reality? The one where I look only at the aggregate data and conclude that whites are at greater risk of dying, or the one where I break the data down by age and conclude that non-whites are at greater risk?

The general answer espoused by introductory statistics textbooks is: control for everything. If you have age data, stratify by age. If you have data on underlying medical conditions, or socioeconomic status, or anything else, stratify by those variables too.

This “one-size-fits-all” approach is misguided because it ignores the causal story behind the data. In The Book of Why, we look at a fictional example of a drug that is intended to prevent heart attacks by lowering blood pressure. We can summarize the causal story in a diagram:

Here blood pressure is what we call a mediator, an intervening variable through which the intervention produces its effect. We also allow for the possibility that the drug may directly influence the chances of a heart attack in other, unknown ways, by drawing an arrow directly from “Drug” to “Heart Attack.”

The diagram tells us how to interrogate the data. Because we want to know the drug’s total effect on the patient, through the intended route as well as other, unintended routes, we should not stratify the data. That is, we should not separate the experimental data into “high-blood-pressure” and “low-blood-pressure” groups. In our book, we give (fictitious) experimental data in which the drug increases the risk of heart attack among people in the low-blood-pressure group and among people in the high-blood-pressure group (presumably because of side effects). But at the same time, and most importantly, it shifts patients from the high-risk high-blood-pressure group into the low-risk low-blood-pressure group. Thus its total effect is beneficial, even though its effect on each stratum appears to be harmful.

It’s interesting to compare this fictitious example to the all-too-real COVID example, which I would argue has a very similar causal structure:

The causal arrow from “race” to “age” means that your race influences your chances of living to age 75 or older. In this diagram, Age is a mediator between Race and Death from COVID; that is, it is a mechanism through which Race acts. As we saw in the data, it’s quite a potent mechanism; in fact, it accounts for why white people who are COVID-positive die more often.

Because the two causal diagrams are the same, you might think that in the second case, too, we should not stratify the data; instead we should use the aggregate data and conclude that COVID is a disease that “discriminates” against whites.

However, this argument ignores the second key ingredient I mentioned earlier: interrogating the data using correctly phrased causal queries.

What is our query in this case? It’s different from what it was in the drug example. In that case, we were looking at the drug as a preventative for a heart attack. If we were to look at the COVID data in the same way, we would ask, “What is the total lifetime effect of intervening (before birth) to change a person’s race?” And yes: if we could perform that intervention, and if our sole objective was to prevent death from COVID, we would choose to change our race from white to non-white. The “benefit” of that intervention would be that we would never live to an age where we were at high risk of dying from COVID.

I’m sure you can see, without my even explaining it, that this is not the query any reasonable person would pose. “Saving” lives from COVID by making them end earlier for other reasons is not a justifiable health policy.

Thus, the query we want to interrogate the data with is not “What is the total effect?” but “What is the direct effect?” As we explain on page 312 of The Book of Why, this is always the query we are interested in when we talk about discrimination. If we want to know whether our health-care system discriminates against a certain ethnic group, then we want to hold all other variables constant that might account for the outcome, and see what is the effect of changing Race alone. In this case, that means stratifying the data by Age, and the result is that we do see evidence of discrimination. Non-whites do worse at (almost) every age. As Wood writes, “The virus knows no race or nationality; it can’t peek at your driver’s license or census form to check whether you are black. Society checks for it, and provides the discrimination on the virus’s behalf.”

To reiterate: The causal story here is identical to the Drug-Blood Pressure-Heart Attack example. What has changed is our query. Precision is required both in formulating the causal model, and in deciding what is the question we want to ask of it.

I wanted to place special emphasis on the query because I recently was asked to referee an article about Simpson’s paradox that missed this exact point. Of course I cannot tell you more about the author or the journal. (I don’t even know who the author is.) It was a good article overall, and I hope that it will be published with a suitable revision. 

In the meantime, there is plenty of room for further exploration of the coronavirus epidemic with causal models. Undoubtedly the diagram above is too simple; unfortunately, if we make it more realistic by including more variables, we may not have any data available to interrogate. In fact, even in this case there is a huge amount of missing data: 51 percent of the COVID cases have unknown race/ethnicity, and 19 percent of the deaths. Thus, while we can learn an excellent lesson about Simpson’s paradox and some probable lessons about racial inequities, we have to present the results with some caution. Finally, I would like to draw attention to something curious in the CDC data: The case fatality rate for whites in the youngest age group, ages 0-4, is much higher than for non-whites. I don’t know how to explain this, and I would think that someone with an interest in pediatric COVID cases should investigate.


Powered by WordPress