Causal Analysis in Theory and Practice

January 4, 2023

Causal Inference (CI) − A year in review

2022 has witnessed a major upsurge in the status of CI, primarily in its general recognition as an independent and essential component in every aspect of intelligent decision making. Visible evidence of this recognition were several prestigious prizes awarded explicitly to CI-related research accomplishments. These include (1) the Nobel Prize in economics, awarded to David Card, Joshua Angrist, and Guido Imbens for their works on cause and effect relations in natural experiments https://www.nobelprize.org/prizes/economic-sciences/2021/press-release/ (2) The BBVA Frontiers of Knowledge Award to Judea Pearl for “laying the foundations of modern AI” https://www.eurekalert.org/news-releases/942893 and (3) The Rousseeuw Prize for Statistics to Jamie Robins, Thomas Richardson, Andrea Rotnitzky, Miguel Hernán, and Eric Tchetchgen Tchetchgen, for their “pioneering work on Causal Inference with applications in Medicine and Public Health” https://www.rousseeuwprize.org/news/winners-2022.
w
My acceptance speech at the BBVA award can serve as a gentle summary of the essence of causal inference, its basic challenges and major achievements: https://www.youtube.com/watch?v=uaq389ckd5o.
 
It is not a secret that I have been critical of the approach Angrist and Imbens are taking in econometrics, for reasons elaborated here https://ucla.in/2FwdsGV, and mainly here https://ucla.in/36EoNzO. I nevertheless think that their selection to receive the Nobel Prize in economics is a positive step for CI, in that it calls public attention to the problems that CI is trying to solve and will eventually inspire curious economists to seek a more broad-minded approach to these problems, so as to leverage the full arsenal of tools that CI has developed.
 
Coupled with these highlights of recognition, 2022 has seen a substantial increase in CI activities on both the academic and commercial fronts. The number of citations to CI related articles has reached a record high of over 10,200 citations in 2022, https://scholar.google.com/citations?user=bAipNH8AAAAJ&hl=en , showing positive derivatives in all CI categories. Dozens, if not hundreds of seminars, workshops and symposia have been organized in major conferences to disseminate progress in CI research. New results on individualized decision making were prominently featured in these meetings (e.g., https://ucla.in/33HSkNI). Several commercial outfits have come up with platforms for CI in their areas of specialization, ranging from healthcare to finance and marketing. (Company names such as #causallense, and Vianai Systems come to mind:
https://www.causalens.com/startup-series-a-funding-round-45-million/, https://arxiv.org/ftp/arxiv/papers/2207/2207.01722.pdf). Naturally, these activities have led to increasing demands for trained researchers and educators, versed in the tools of CI; jobs openings explicitly requiring experience in CI have become commonplace in both industry and academia.
 
I am also happy to see CI becoming an issue of contention in AI and Machine Learning (ML), increasingly recognized as an essential capability for human-level AI and, simultaneously, raising the question of whether the data-fitting methodologies of Big Data and Deep Learning could ever acquire these capabilities. In https://ucla.in/3d2c2Fi I’ve answered this question in the negative, though various attempts to dismiss CI as a species of “inductive bias” (e.g., https://www.youtube.com/watch?v=02ABljCu5Zw) or “missing data problem” (e.g., https://www.jstor.org/stable/pdf/26770992.pdf) are occasionally being proposed as conceptualizations that could potentially replace the tools of CI. The Ladder of Causation tells us what extra-data information would be required to operationalize such metaphorical aspirations.
 
Researchers seeking a gentle introduction to CI are often attracted to multi-disciplinary forums or debates, where basic principles are compiled and where differences and commonalities among various approaches are compared and analyzed by leading researchers. Not many such forums were published in 2022, perhaps because the differences and commonalities are now well understood or, as I tend to believe, CI and its Structural Causal Model (SCM) unifies and embraces all other approaches. I will describe two such forums in which I participated.
 
(1) In March of 2022, the Association for Computing Machinery (ACM) has published an anthology containing highlights of my works (1980-2020) together with commentaries and critics from two dozens authors, representing several disciplines. The Table of Content can be seen here: https://ucla.in/3hLRWkV. It includes 17 of my most popular papers, annotated for context and scope, followed by 17 contributed articles of colleagues and critics. The ones most relevant to CI in 2022 are in Chapters 21-26.
 
Among these, I consider the causal resolution of Simpson’s paradox (Chapter 22, https://ucla.in/2Jfl2VS) to be one of the crown achievements of CI. The paradox lays bare the core differences between causal and statistical thinking, and its resolution brings an end to a century of debates and controversies by the best philosophers of our time. It is also related to Lord’s Paradox (see https://ucla.in/2YZjVFL) − a qualitative version of Simpson’s Paradox which became a focus of endless debates with statisticians and trialists throughout 2022 (on Twitter @yudapearl). I often cite Simpson’s paradox as a proof that our brain is governed by causal, not statistical, calculus.
 
This question − causal or statistical brain − is not a cocktail party conversation but touches on the practical question of choosing an appropriate language for casting the knowledge necessary for commencing any CI exercise. Philip Dawid − a proponent of counterfactual-free statistical languages − has written a critical essay on the topic (https://www.degruyter.com/document/doi/10.1515/jci-2020-0008/html?lang=en) and my counterfactual-based rebuttal, https://ucla.in/3bXCBy3, clarifies the issues involved.
 
(2) The second forum of inter-disciplinary discussions can be found in a special issue of the Journal Observational Studieshttps://muse.jhu.edu/pub/56/article/867085/pdf (edited by Ian Shrier, Russell Steele, Tibor Schuster and Mireille Schnitzer) in a form of interviews with Don Rubin, Jamie Robins, James Heckman and myself.
 
In my interview, https://ftp.cs.ucla.edu/pub/stat_ser/r523.pdf, I compiled aspects of CI that I normally skip in scholarly articles. These include historical perspectives of the development of CI, its current state of affairs and, most importantly for our purpose, the lingering differences between CI and other frameworks. I believe that this interview provides a fairly concise summary of these differences, which have only intensified in 2022.
 
Most disappointing to me are the graph-avoiding frameworks of Rubin, Angrist, Imbens and Heckman, which still dominate causal analysis in economics and some circles of statistics and social science. The reasons for my disappointments are summarized in the following paragraph:
Graphs are new mathematical objects, unfamiliar to most researchers in the statistical sciences, and were of course rejected as “non-scientific ad-hockery” by top leaders in the field [Rubin, 2009]. My attempts to introduce causal diagrams to statistics [Pearl, 1995; Pearl, 2000] have taught me that inertial forces play at least as strong a role in science as they do in politics. That is the reason that non-causal mediation analysis is still practiced in certain circles of social science [Hayes, 2017], “ignorability” assumptions still dominate large islands of research [Imbens and Rubin, 2015], and graphs are still tabooed in the econometric literature [Angrist and Pischke, 2014]. While most researchers today acknowledge the merits of graph as a transparent language for articulating scientific information, few appreciate the computational role of graphs as “reasoning engines,” namely, bringing to light the logical ramifications of the information used in their construction. Some economists even go to great pains to suppress this computational miracle [Heckman and Pinto, 2015; Pearl, 2013].
 
My disagreements with Heckman go back to 2007 when he rejected the do-operator for metaphysical reasons (see https://ucla.in/2NnfGPQ#page=44) and then to 2013, when he celebrated the do-operator after renaming it “fixing” but remained in denial of d-separation (see https://ucla.in/2L8OCyl). In this denial he retreated 3 decades in time while castrating graphs from their inferential power. Heckman’s 2022 interview in Observational Studies continues his on-going crusade to prove that econometrics has nothing to learn from neighboring fields. His fundamental mistake lies in assuming that the rules of do-calculus lie “outside of formal statistics”; they are in fact logically derivable from formal statistics, REGARDLESS of our modeling assumptions but (much like theorems in geometry) once established, save us the labor of going back to the basic axioms.
 
My differences with Angrist, Imbens and Rubin go even deeper (see https://ucla.in/36EoNzO), for they involve not merely the avoidance of graphs but also the First Law of Causal Inference (https://ucla.in/2QXpkYD) hence issues of transparency and credibility. These differences are further accentuated in Imbens’s Nobel lecture https://onlinelibrary.wiley.com/doi/pdf/10.3982/ECTA21204 which treats CI as a computer science creation, irrelevant to “credible” econometric research. In https://ucla.in/2L8OCyl, as well as in my book Causality, I present dozens of simple problems that economists need, but are unable to solve, lacking the tools of CI.
 
It is amazing to watch leading researchers, in 2022, still resisting the benefits of CI while committing their respective fields to the tyranny of outdatedness.
 
To summarize, 2022 has seen an unprecedented upsurge in CI popularity, activity and stature. The challenge of harnessing CI tools to solve critical societal problems will continue to inspire creative researchers from all fields, and the aspirations of advancing towards human-level artificial intelligence will be pursued with an accelerated pace in 2023.

Wishing you a productive new year,
Judea

 

May 17, 2022

What statisticians mean by ‘Causal Inference’: Is Gelman’s blog representative?

Andrew Gelman posted a new blog on Causal Inference https://statmodeling.stat.columbia.edu/2022/05/14/causal-is-what-we-say-when-we-dont-know-what-were-doing/#comment-2053584 which I have found to be not only strange, but wrong. Among the statements that I find objectionable is the title: “Causal” is like “error term”: it’s what we say when we’re not trying to model the process.

I have posted a couple of comments there, expressing my bewilderment, and summarized them in the following statement:

Andrew,
Re-reading your post, I pause at every line that mentions “causal inference” and I say to myself: This is not my “causal inference,” and if Andrew is right that this is what statisticians mean by “causal inference,” then there are two non intersecting kinds of “causal inference” in the world, one used by statisticians and one by people in my culture whom, for lack of better words, I call “Causal Inference Folks.”

I cannot go over every line, but here is a glaring one: “causal inference is all about the aggregation of individual effects into average effects, and if you have a direct model for individual effects, then you just fit it directly.”

Not in my culture. I actually go from average effects to individual effects. See https://ucla.in/3aZx2eQ and https://ucla.in/33HSkNI. Moreover, I have never seen “a direct model for individual effects” unless it is an SCM. Is that what you had in mind? If so, how does it differ from a “mechanistic model.” What would I be missing if I use SCM and never mention “mechanistic models”?

Bottom line, your post reinforces my explicit distinction between “statisticians” and “causal inference folks” to the point where I can hardly see an overlap. To make it concrete, let me ask a quantitative question: How many “statisticians” do you know who subscribe to the First Law of Causal Inference https://ucla.in/2QXpkYD, or to the Ladder of Causation https://ucla.in/2URVLZW, or to the backdoor criterion or etc? These are foundational notions that we “causal inference folks” consider to be the DNA of our culture, without which we are back in pre-1990 era.

For us, “Causal” is not like “error term”: it’s what we say when we ARE trying to model the process.

December 28, 2020

The Domestication of Causal Reasoning

Filed under: Causal models,Deep Learning,Deep Understanding — judea @ 10:51 pm

1. Introduction

On Wednesday December 23 I had the honor of participating in “AI Debate 2”, a symposium organized by Montreal AI, which brought together an impressive group of scholars to discuss the future of AI. I spoke on

“The Domestication of Causal Reasoning: Cultural and Methodological Implications,”

and the reading list I proposed as background material was: 

  1. “The Seven Tools of Causal Inference with Reflections on Machine Learning,” July 2018 https://ucla.in/2HI2yyx
  2. “Radical Empiricism and Machine Learning Research,” July 26, 2020 https://ucla.in/32YKcWy
  3. “Data versus Science: Contesting the Soul of Data-Science,” July 7, 2020 https://ucla.in/3iEDRVo

The debate was recorded here https://montrealartificialintelligence.com/aidebate2/ and my talk can be accessed here: https://youtu.be/gJW3nOQ4SEA

Below is an edited script of my talk.

2. What I would have said had I been given six (6), instead of three (3) minutes

This is the first time I am using the word “domestication” to describe what happened in causality-land in the past 3 decades. I’ve used other terms before: “democratization,” “mathematization,” or “algorithmization,” but Domestication sounds less provocative when I come to talk about the causal revolution.

What makes it a “revolution” is seeing dozens of practical and conceptional problems that only a few decades ago where thought to be metaphysical or unsolvable give way to simple mathematical solutions.

“DEEP UNDERSTANDING” is another term used here for the first time. It so happened that, while laboring to squeeze out results from causal inference engines, I came to realize that we are sitting on a gold mine, and what we are dealing with is none other but:

A computational model of a mental state that deserves the title “Deep Understanding” 

“Deep Understanding” is not the nebulous concept that you probably think it is, but something that is defined formally as any system capable of covering all 3 levels of the causal hierarchy: What is – What if – Only if. More specifically: What if I see (prediction) – What if I do (intervention) – and what if acted differently (retrospection, in light of the outcomes observed).

This may sound like cheating – I take the capabilities of one system (i.e., a causal model) and I posit them as a general criterion for defining a general concept such as: “Deep Understanding.”

It isn’t cheating. Given that causal reasoning is so deeply woven into our day to day language, our thinking, our sense of justice, our humor and of course our scientific understanding, I think that it won’t be too presumptuous of me to propose that we take Causal Modeling as a testing ground of ideas on other modes of reasoning associated with “understanding.”

Specifically, causal models should provide an arena for various theories explanations, fairness, adaptation, imagination, humor, consciousness, free will, attention, and curiosity.

I also dare speculate that learning from the way causal reasoning was domesticated, would benefit researchers in other area of AI, including vision and NLP, and enable them to examine whether similar paths could be pursued to overcome obstacles that data-centric paradigms have imposed.

I would like now to say a few words on the Anti-Cultural implications of the Causal revolution. Here I refer you to my blog post, https://ucla.in/32YKcWy where I argue that radical empiricism is a stifling culture. It lures researchers into a data-centric paradigm, according to which Data is the source of all knowledge rather than a window through which we learn about the world around us.

What I advocate is a hybrid system that supplements data with domain knowledge, commonsense constraints, culturally transmitted concepts, and most importantly, our innate causal templates that enable toddlers to quickly acquire an understanding of their toy-world environment.

It is hard to find a needle in a hay stack, it is much harder if you haven’t seen a needle before. The module we are using for causal inference gives us a picture of what the needle looks like and what you can do once you find one.

October 14, 2020

Causally Colored Reflections on Leo Breiman’s “Statistical Modeling: The Two Cultures” (2001) https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726

Filed under: Causal models — judea @ 9:38 am

Enticed by a recent seminar on this subject, I have re-read Breiman’s influential paper and would like to share with readers a re-assessment of its contributions to the art of statistical modeling.

When the paper first appeared, in 2001, I had the impression that, although the word “cause” did not appear explicitly, Breiman was trying to distinguish data-descriptive models from models of the data-generation process, also called “causal,” “substantive,” “subject-matter,” or “structural” models. Unhappy with his over-emphasis on prediction, I was glad nevertheless that a statistician of Breiman’s standing had recognized the on-going confusion in the field, and was calling for making the distinction crisp.

Upon re-reading the paper in 2020 I have realized that the two cultures contrasted by Breiman are not descriptive vs. causal but, rather, two styles of descriptive modeling, one interpretable, the other uninterpretable. The former is exemplified by predictive regression models, and the latter by modern big-data algorithms such as deep-learning, BART, trees and forests. The former carries the potential of being interpreted as causal, the latter leaves no room for such interpretation; it describes the prediction process chosen by the analyst, not the data-generation process chosen by nature. Breiman’s main point is: If you want prediction, do prediction for its own sake and forget about the illusion of representing nature.

Breiman’s paper deserves its reputation as a forerunner of modern machine learning techniques, but falls short of telling us what we should do if we want the model to do more than just prediction, say, to extract some information about how nature works, or to guide policies and interventions. For him, accurate prediction is the ultimate measure of merit for statistical models, an objective shared by present day machine learning enterprise, which accounts for many of its limitations (https://ucla.in/2HI2yyx).

In their comments on Breiman’s paper, David Cox and Bradley Efron noticed this deficiency and wrote:

“… fit, which is broadly related to predictive success, is not the primary basis for model choice and formal methods of model choice that take no account of the broader objectives are suspect. [The broader objectives are:] to establish data descriptions that are potentially causal.” (Cox, 2001)

And Efron concurs:

“Prediction by itself is only occasionally sufficient. … Most statistical surveys have the identification of causal factors as their ultimate goal.” (Efron, 2001)

As we read Breiman’s paper today, armed with what we know about the proper symbiosis of machine learning and causal modeling, we may say that his advocacy of algorithmic prediction was justified. Once guided by a causal model for identification and bias reduction, the predictive component of our model can safely be trusted to non-interpretable algorithms. The interpretation can be accomplished separately by the causal component of our model, as demonstrated, for example, in https://ucla.in/2HI2yyx.

Separating data-fitting from interpretation, an idea that was rather innovative in 2001, has withstood the test of time.

Judea


ADDENDUM-1

Andrew Gelman’s reflections on Breiman’s paper can be found in this article http://www.stat.columbia.edu/~gelman/research/published/gelman_breiman.pdf.


ADDENDUM-2

The following is an email exchange between Ying Nian Wu (UCLA, Statistics) and Judea Pearl (UCLA, Computer Science/Statistics).

1. Ying Nian Wu to J. Pearl, October 12, 2020

Dear Judea,

I feel all models are about making predictions for future observations. The only difference is that causal model is to predict p(y|do(x)) in your notation, where the testing data (after cutting off the arrows into x by your diagram surgery) come from a different distribution than the training data, i.e., we want to extrapolate from training data to testing data (in fact, extrapolation and interpolation are relative — a simple model that can interpolate a vast range is quite extrapolative). Ultimately a machine learning model also wants to achieve extrapolative prediction, such as the so-called transfer learning and meta learning, where testing data are different from training data, or the current short-term experience (small new training data) is different from the past long-term experience (big past training data).

About learning the model from data, we can learn p(y|x), but we can also learn p(y, x) = p(y) p(x|y). We may call p(y|x) predictive, and p(x|y) (or p(y, x)) generative, and both may involve hidden variables z. The generative model can learn from data where y is often unavailable (the so-called semi-supervised learning). In fact, learning a generative model p(y, z, x) = p(z) p(y, x|z) is necessary for predicting p(y|do(x)). I am not sure if this is also related to the two cultures mentioned by Brieman. I once asked him (at a workshop at Banff, while enjoying some second-hand smoking) about the two models, and he actually preferred generative model, although in his talk, he also emphasized that a non-parametric predictive model such as forest is still interpretable in terms of assessing the influences of variables.

To digress a bit further, there is no such a thing called how nature works according to the Copenhagen interpretation of quantum physics: there must be an observer, the observer makes a measurement, and the wave function predicts the probability distribution of the measurement. As to the question of what happens when there is no observer or the observer is not observing, the answer is that such a question is irrelevant.

Even back to the classical regime where we can ask such a question, Ptolemy’s epicycle model on planet motion, Newton’s model of gravitation, and Einstein’s model of general relativity are not that different. Ptolemy’s model is actually more general and flexible (being a Fourier expansion, where the cycle on top of cycles is similar in style to the perceptron on top of perceptrons of neural network). Newton’s model is simpler, while Einstein’s model fits the data better (being equally simple but more involved in calculation). They are all illusions about how nature works, learned from the data, and intended to predict future data. Newton’s illusion is action at a distance (which he himself did not believe), while Einstein’s illusion is about bending of spacetime, which is more believable, but still an illusion nonetheless (to be superseded by a deeper illusion such as a string).

So Box is still right: all models are wrong, but some are useful. Useful in terms of making predictions, especially making extrapolative predictions.
Ying Nian

2. J. Pearl to Ying Nian Wu, October 14, 2020

Dear Ying Nian,
Thanks for commenting on my “Causally Colored Reflections.” 

I will start from the end of your comment, where you concur with George Box that “All models are wrong, but some are useful.” I have always felt that this aphorism is painfully true but hardly useful. As one of the most quoted aphorism in statistics, it ought to have given us some clue as to what makes one model more useful than another – it doesn’t.

A taxonomy that helps decide model usefulness should tell us (at the very least) whether a given model can answer the research question we have in mind, and where the information encoded in the model comes from. Lumping all models in one category, as in “all models are about making prediction for future observations” does not provide this information. It reminds me of Don Rubin’s statement that causal inference is just a “missing data problem” which, naturally, raises the question of what problems are NOT missing data problems, say, mathematics, chess or astrology.

In  contrast, the taxonomy defined by the Ladder of Causation (see https://ucla.in/2HI2yyx): 1. Association, 2. Intervention, 3. Counterfactuals,  does provide such information. Merely looking  at the syntax of a model one can tell whether it can answer the target research question, and where the information supporting the model should come from, be it observational studies, experimental data, or theoretical assumptions. The main claim of the Ladder (now a theorem) is that one cannot answer questions at level i unless one has information of type i or higher. For example, there is no way to answer policy related questions unless one has experimental data or assumptions about such data. As another example, I look at what you call a generative model p(y,z,x) = p(z)p(y, x|z) and I can tell right away that, no matter how smart we are, it is not sufficient for predicting p(y|do(x)).

If you doubt the usefulness of this taxonomy, just examine the amount of efforts spent (and is still being spent) by the machine learning community on the so-called “transfer learning” problem. This effort has been futile because elementary inspection of the extrapolation task tells us that it cannot be accomplished using non-experimental data, shifting or not. See https://ucla.in/2N7S0K9.

In summary, unification of research problems is helpful when it facilitates the transfer of tools across problem types. Taxonomy of research problems is helpful too; for it spares us the efforts of trying the impossible, and it tells us where we should seek the information to support our models.

Thanks again for engaging in this conversation,
Judea

3.  Wu to J. Pearl, October 14, 2020

Dear Judea,
Thanks for the inspiring discussion. Please allow me to formulate our consensus, and I will stop at here.

Unification 1: All models are for prediction.
Unification 2: All models are for the agent to plan the action. Unification 2 is deeper than Unification 1. But Unification 1 is a good precursor.

Taxonomy 1: (a) models that predict p(y|x). (b) models that predict p(y|do(x)) or (c) models that can fill in Rubin’s table.
Taxonomy 2: (a) models that fit data, not necessarily make sense, only for prediction. (b) models that understand how nature works and are interpretable.

Taxonomy 1 is deeper and more precise than Taxonomy 2, thanks to the foundational work of you and Rubin. It is based on precise, well-defined, operational mathematical language and formulation.

Taxonomy 2 is useful and is often aligned with Taxonomy 1, but we need to be aware of the limitation of Taxonomy 2, which is all I want to say in my comments. Much ink has been spilled on Taxonomy 2 because of its imprecise and non-operational nature.
Ying Nian

July 26, 2020

Radical Empiricism and Machine Learning Research

Filed under: Causal models,Knowledge representation,Machine learning — judea @ 7:02 pm

A speaker at a lecture that I have attended recently summarized the philosophy of machine learning this way: “All knowledge comes from observed data, some from direct sensory experience and some from indirect experience, transmitted to us either culturally or genetically.”

The statement was taken as self-evident by the audience, and set the stage for a lecture on how the nature of “knowledge” can be analyzed by examining patterns of conditional probabilities in the data. Naturally, it invoked no notions such as “external world,” “theory,” “data generating process,” “cause and effect,” “agency,” or “mental constructs” because, ostensibly, these notions, too, should emerge from the data if needed. In other words, whatever concepts humans invoke in interpreting data, be their origin cultural, scientific or genetic, can be traced to, and re-derived from the original sensory experience that has endowed those concepts with survival value.

Viewed from artificial intelligence perspective, this data-centric philosophy offers an attractive, if not seductive agenda for machine learning research: In order to develop human level intelligence, we should merely trace the way our ancestors did it, and simulate both genetic and cultural evolutions on a digital machine, taking as input all the data that we can possibly collect. Taken to extremes, such agenda may inspire fairly futuristic and highly ambitious scenarios: start with a simple neural network, resembling a primitive organism (say an Amoeba), let it interact with the environment, mutate and generate offsprings; given enough time, it will eventually emerge with an Einstein’s level of intellect. Indeed, ruling out sacred scriptures and divine revelation, where else could Einstein acquire his knowledge, talents and intellect if not from the stream of raw data that has impinged upon the human race since antiquities, including of course all the sensory inputs received by more primitive organisms preceding humans.

Before asking how realistic this agenda is, let us preempt the discussion with two observations:

(1) Simulated evolution, in some form or another, is indeed the leading paradigm inspiring most machine learning researchers today, especially those engaged in connectionism, deep learning and neural networks technologies which deploy model-free, statistics-based learning strategies. The impressive success of these strategies in applications such as computer vision, voice recognition and self-driving cars has stirred up hopes in the sufficiency and unlimited potentials of these strategies, eroding, at the same time, interest in model-based approaches.

(2) The intellectual roots of the data-centric agenda are deeply grounded in the empiricist branch of Western philosophy, according to which sense-experience is the ultimate source of all our concepts and knowledge, with little or no role given to “innate ideas” and “reason” as sources of knowledge (Markie, 2017). Empiricist ideas can be traced to the ancient writings of Aristotle, but have been given prominence by the British empiricists Francis Bacon, John Locke, George Berkeley and David Hume and, more recently, by philosophers such as Charles Sanders Pierce, and William James. Modern connectionism has in fact been viewed as a Triumph of Radical Empiricism over its rationalistic rivals (Buckner 2018; Lipton, 2015). It can definitely be viewed as a testing grounds in which philosophical theories about the balance between empiricism and innateness can be submitted to experimental evaluation on digital machines.

The merits of testing philosophical theories notwithstanding, I have three major reservations about the wisdom of pursuing a radical empiricist agenda for machine learning research.  I will present three arguments why empiricism should be balanced with the principles of model-based science (Pearl, 2019), in which learning is guided by two sources of information: (a) data and (b) man-made models of how data are generated.  

I label the three arguments: (1) Expediency, (2) Transparency and (3) Explainability and will discuss them in turns below:

1. Expediency
Evolution is too slow a process (Turing, 1950), since most mutations are useless if not harmful, and waiting for natural selection to distinguish and filter the useful from the useless is often un-affordable. The bulk of machine learning tasks requires speedy interpretation of, and quick reaction to new and sparse data, too sparse to allow filtering by random mutations. The outbreak of the COVID-19 pandemic is a perfect example of a situation where sparse data, arriving from unreliable and heterogeneous sources required quick interpretation and quick action, based primarily on prior models of epidemic transmission and data production (https://ucla.in/3iEDRVo). In general, machine learning technology is expected to harness a huge amount of scientific knowledge already available, combine it with whatever data can be gathered, and solve crucial societal problems in areas such as health, education, ecology and economics.

Even more importantly, scientific knowledge can speed up evolution by actively guiding the selection or filtering of data and data sources. Choosing what data to consider or what experiments to run requires hypothetical theories of what outcomes are expected from each option, and how likely they are to improve future performance. Such expectations are provided, for example, by causal models that predict both the outcomes of hypothetical manipulations as well the consequences of counterfactual undoing of past events (Pearl, 2019).

2. Transparency
World knowledge, even if evolved spontaneously from raw data, must eventually be compiled and represented in some machine form to be of any use. The purpose of compiled knowledge is to amortize the discovery process over many inference tasks without repeating the former. The compiled representation should then facilitate an efficient production of answers to select set of decision problems, including questions on ways of gathering additional data. Some representations allow for such inferences and others do not. For example, knowledge compiled as patterns of conditional probability estimates does not allow for predicting the effect of actions or policies. (Pearl, 2019).

Knowledge compilation involves both abstraction and re-formatting. The former allows for information loss (as in the case of probability models) while the latter retains the information content and merely transform some of the information from implicit to explicit representations.

These considerations demand that we study the mathematical properties of compiled representations, their inherent limitations, the kind of inferences they support, and how effective they are in producing the answers they are expected to produce. In more concrete terms, machine learning researchers should engage in what is currently called “causal modelling” and use the tools and principles of causal science to guide data exploration and data interpretation processes.

3. Explainability
Regardless of how causal knowledge is accumulated, discovered or stored, the inferences enabled by that knowledge are destined to be delivered to, and benefit a human user. Today, these usages include policy evaluation, personal decisions, generating explanations, assigning credit and blame or making general sense of the world around us. All inferences must therefore be cast in a language that matches the way people organize their world knowledge, namely, the language of cause and effect. It is imperative therefore that machine learning researchers regardless of the methods they deploy for data fitting, be versed in this user-friendly language, its grammar, its universal laws and the way humans interpret or misinterpret the functions that machine learning algorithms discover.

Conclusions
It is a mistake to equate the content of human knowledge with its sense-data origin. The format in which knowledge is stored in the mind (or on a computer) and, in particular, the balance between its implicit vs. explicit components are as important for its characterization as its content or origin.  

While radical empiricism may be a valid model of the evolutionary process, it is a bad strategy for machine learning research. It gives a license to the data-centric thinking, currently dominating both statistics and machine learning cultures, according to which the  secret to rational decisions lies in the data alone.

A hybrid strategy balancing “data-fitting” with “data-interpretation” better captures the stages of knowledge compilation that the evolutionary processes entails.

References:
Buckner, C. (2018) “Deep learning: A philosophical introduction,” Philosophy Compass, https://doi.org/10.1111/phc3.12625.

Lipton, Z. (2015) “Deep Learning and the Triumph of Empiricism,” ND Nuggets News, July. Retrieved from: https://www.kdnuggets.com/2015/07/deep-learning-triumph-empiricism-over-theoretical-mathematical-guarantees.html.

Markie, P. (2017) “Rationalism vs. Empiricism,” Stanford Encyclopedia of Philosophy, https://plato.stanford.edu/entries/rationalism-empiricism/.

Pearl, J. (2019) “The Seven Tools of Causal Inference with Reflections on Machine Learning,” Communications of ACM, 62(3): 54-60, March, https://cacm.acm.org/magazines/2019/3/234929-the-seven-tools-of-causal-inference-with-reflections-on-machine-learning/fulltext.

Turing, A.M. (1950) I. — Computing Machinery and Intelligence,” Mind, LIX (236): 433-460, October,  https://doi.org/10.1093/mind/LIX.236.433.


The following email exchange with Yoshua Bengio clarifies the claims and aims of the post above.

Yoshua Bengio commented Aug 3 2020 2:21 pm

Hi Judea,

Thanks for your blog post! I have a high-level comment. I will start from your statement that “learning is guided by two sources of information: (a) data and (b) man-made models of how data are generated. ” This makes sense in the kind of setting you have often discussed in your writings, where a scientist has strong structural knowledge and wants to combine it with data in order to arrive at some structural (e.g. causal) conclusions. But there are other settings where this view leaves me wanting more. For example, think about a baby before about 3 years old, before she can gather much formal knowledge of the world (simply because her linguistic abilities are not yet developed or not enough developed, not to mention her ability to consciously reason). Or think about how a chimp develops an intuitive understanding of his environment which includes cause and effect. Or about an objective to build a robot which could learn about the world without relying on human-specified theories. Or about an AI which would have as a mission to discover new concepts and theories which go well beyond those which humans provide. In all of these cases we want to study how both statistical and causal knowledge can be (jointly) discovered. Presumably this may be from observations which include changes in distribution due to interventions (our learning agent’s or those of other agents). These observations are still data, just of a richer kind than what current purely statistical models (I mean trying to capture only joint distributions or conditional distribution) are built on. Of course, we *also* need to build learning machines which can interact with humans, understand natural language, explain their decisions (and our decisions), and take advantage of what human culture has to offer. Not taking advantage of knowledge when we have it may seem silly, but (a) our presumed knowledge is sometimes wrong or incomplete, (b) we still want to understand how pre-linguistic intelligence manages to make sense of the world (including of its causal structure), and (c) forcing us into this more difficult setting could also hasten the discovery of the learning principles required to achieve (a) and (b).

Cheers and thanks again for your participation in our recent CIFAR workshop on causality!

— Yoshua

Judea Pearl reply, August 4 5:53 am

Hi Yoshua,
The situation you are describing: “where a scientist has strong structural knowledge and wants to combine it with data in order to arrive at some structural (e.g. causal) conclusions” motivates only the first part of my post (labeled “expediency”). But the enterprise of causal modeling brings another resource to the table. In addition to domain specific knowledge, it brings a domain-independent “template” that houses that knowledge and which is useful for precisely the “other settings” you are aiming to handle:

“a baby before about 3 years old, before she can gather much formal knowledge of the world … Or think about how a chimp develops an intuitive understanding of his environment which includes cause and effect. Or about an objective to build a robot which could learn about the world without relying on human-specified theories.”

A baby and a chimp exposed to the same stimuli will not develop the same understanding of the world, because the former starts with a richer inborn template that permits it to organize, interpret and encode the stimuli into a more effective representation. This is the role of “compiled representations” mentioned in the second part of my post. (And by “stimuli”, I include “playful manipulations”) .

In other words, the baby’s template has a richer set of blanks to be filled than the chimp’s template, which accounts for Alison Gopnik’s finding of a greater reward-neutral curiosity in the former.

The science of Causal Modeling proposes a concrete embodiment of that universal “template”. The mathematical properties of the template, its inherent limitations and  inferential and algorithmic capabilities should therefore be studied by every machine learning researcher, regardless of whether she obtains it from domain expert or discovers it on her own from invariant features of the data.

Finding a needle in a haystack is difficult, and it’s close to impossible if you haven’t seen a needle before. Most ML researchers today have not seen a needle — an educational gap that needs to be corrected in order to hasten the discovery of those learning principles you aspire to uncover.

Cheers and thanks for inviting me to participate in your CIFAR workshop on causality.

— Judea

Yoshua Bengio comment Aug. 4, 7:00 am

Agreed. What you call the ‘template’ is something I sort in the machine learning category of ‘inductive biases’ which can be fairly general and allow us to efficiently learn (and here discover representations which build a causal understanding of the world).

— Yoshua

Powered by WordPress