Causal Analysis in Theory and Practice

July 7, 2020

Data versus Science: Contesting the Soul of Data-Science

Filed under: Book (J Pearl),Counterfactual,Data Fusion — judea @ 1:02 pm

Summary
The post below is written for the upcoming Spanish translation of The Book of Why, which was announced today. It expresses my firm belief that the current data-fitting direction taken by “Data Science” is temporary (read my lips!), that the future of “Data Science” lies in causal data interpretation and that we should prepare ourselves for the backlash swing.

Data versus Science: Contesting the Soul of Data-Science
Much has been said about how ill-prepared our health-care system was in coping with catastrophic outbreaks like COVID-19. Yet viewed from the corner of my expertise, the ill-preparedness can also be seen as a failure of information technology to keep track of and interpret the outpour of data that have arrived from multiple and conflicting sources, corrupted by noise and omission, some by sloppy collection and some by deliberate misreporting, AI could and should have equipped society with intelligent data-fusion technology, to interpret such conflicting pieces of information and reason its way out of the confusion.

Speaking from the perspective of causal inference research, I have been part of a team that has developed a complete theoretical underpinning for such “data-fusion” problems; a development that is briefly described in Chapter 10 of The Book of Why. A system based on data fusion principles should be able to attribute disparities between Italy and China to differences in political leadership, reliability of tests and honesty in reporting, adjust for such differences and automatically infer behavior in countries like Spain or the US. AI is in a position to to add such data-interpreting capabilities on top of the data-fitting technologies currently in use and, recognizing that data are noisy, filter the noise and outsmart the noise makers.

“Data fitting” is the name I frequently use to characterize the data-centric thinking that dominates both statistics and machine learning cultures, in contrast to the “data-interpretation” thinking that guides causal inference. The data-fitting school is driven by the faith that the secret to rational decisions lies in the data itself, if only we are sufficiently clever at data mining. In contrast, the data-interpreting school views data, not as a sole object of inquiry but as an auxiliary means for interpreting reality, and “reality” stands for the processes that generate the data.

I am not alone in this assessment. Leading researchers in the “Data Science” enterprise have come to realize that machine learning as it is currently practiced cannot yield the kind of understanding that intelligent decision making requires. However, what many fail to realize is that the transition from data-fitting to data-understanding involves more than a technology transfer; it entails a profound paradigm shift that is traumatic if not impossible. Researchers whose entire productive career have committed them to the supposition that all knowledge comes from the data cannot easily transfer allegiance to a totally alien paradigm, according to which extra-data information is needed, in the form of man-made, causal models of reality. Current machine learning thinking, which some describe as “statistics on steroids,” is deeply entrenched in this self-propelled ideology.

Ten years from now, historians will be asking: How could scientific leaders of the time allow society to invest almost all its educational and financial resources in data-fitting technologies and so little on data-interpretation science? The Book of Why attempts to answer this dilemma by drawing parallels to historically similar situations where ideological impediments held back scientific progress. But the true answer, and the magnitude of its ramifications, will only be unravelled by in-depth archival studies of the social, psychological and economical forces that are currently governing our scientific institutions.

A related, yet perhaps more critical topic that came up in handling the COVID-19 pandemic, is the issue of personalized care. Much of current health-care methods and procedures are guided by population data, obtained from controlled experiments or observational studies. However, the task of going from these data to the level of individual behavior requires counterfactual logic, which has been formalized and algorithmatized in the past 2 decades (as narrated in Chapter 8 of The Book of Why), and is still a mystery to most machine learning researchers.

The immediate area where this development could have assisted the COVID-19 pandemic predicament concerns the question of prioritizing patients who are in “greatest need” for treatment, testing, or other scarce resources. “Need” is a counterfactual notion (i.e., patients who would have gotten worse had they not been treated) and cannot be captured by statistical methods alone. A recently posted blog page https://ucla.in/39Ey8sU demonstrates in vivid colors how counterfactual analysis handles this prioritization problem.

The entire enterprise known as “personalized medicine” and, more generally, any enterprise requiring inference from populations to individuals, rests on counterfactual analysis, and AI now holds the key theoretical tools for operationalizing this analysis.

People ask me why these capabilities are not part of the standard tool sets available for handling health-care management. The answer lies again in training and education. We have been rushing too eagerly to reap the low-lying fruits of big data and data fitting technologies, at the cost of neglecting data-interpretation technologies. Data-fitting is addictive, and building more “data-science centers” only intensifies the addiction. Society is waiting for visionary leadership to balance this over-indulgence by establishing research, educational and training centers dedicated to “causal science.”

I hope it happens soon, for we must be prepared for the next pandemic outbreak and the information confusion that will probably come in its wake.

December 11, 2019

Generalizing Experimental Results by Leveraging Knowledge of Mechanisms

Filed under: Data Fusion,Generalizability,Identification — Judea Pearl @ 8:44 pm

In a recent post (and papers), Anders Huitfeldt and co-authors have discussed ways of achieving external validity in the presence of “effect heterogeneity.” These results are not immediately inferable using a standard (non-parametric) selection diagram, which has led them to conclude that selection diagrams may not be helpful for  “thinking more closely about effect heterogeneity” and, thus, might be “throwing the baby out with the bathwater.”

Taking a closer look at the analysis of Anders and co-authors, and using their very same examples, we came to quite different conclusions. In those cases, transportability is not immediately inferable in a fully nonparametric structural model for a simple reason: it relies on functional constraints on the structural equation of the outcome. Once these constraints are properly incorporated in the analysis, all results flow naturally from the structural model, and selection diagrams prove to be indispensable for thinking about heterogeneity, for extrapolating results across populations, and for protecting analysts from unwarranted generalizations.  See details in the full note.

March 10, 2018

Challenging the Hegemony of Randomized Controlled Trials: Comments on Deaton and Cartwright

Filed under: Data Fusion,RCTs — Judea Pearl @ 12:20 am

I was asked to comment on a recent article by Angus Deaton and Nancy Cartwright (D&C), which touches on the foundations of causal inference. The article is titled: “Understanding and misunderstanding randomized controlled trials,” and can be viewed here: https://goo.gl/x6s4Uy

My comments are a mixture of a welcome and a puzzle; I welcome D&C’s stand on the status of randomized trials, and I am puzzled by how they choose to articulate the alternatives.

D&C’s main theme is as follows: “We argue that any special status for RCTs is unwarranted. Which method is most likely to yield a good causal inference depends on what we are trying to discover as well as on
what is already known.” (Quoted from their introduction)

As a veteran challenger of the supremacy of the RCT, I welcome D&C’s challenge wholeheartedly. Indeed, “The Book of Why” (forthcoming, may 2018, http://bayes.cs.ucla.edu/WHY/) quotes me as saying:
“If our conception of causal effects had anything to do with randomized experiments, the latter would have been invented 500 years before Fisher.” In this, as well as in my other writings I go so far as claiming that the RCT earns its legitimacy by mimicking the do-operator, not the other way around. In addition, considering the practical difficulties of conducting an ideal RCT, observational studies have a definite advantage: they interrogate populations at their natural habitats, not in artificial environments choreographed by experimental protocols.

Deaton and Cartwright’s challenge of the supremacy of the RCT consists of two parts:

  1. The first (internal validity) deals with the curse of dimensionality and argues that, in any single trial, the outcome of the RCT can be quite distant from the target causal quantity, which is usually the average treatment effect (ATE). In other words, this part concerns imbalance due to finite samples, and reflects the traditional bias-precision tradeoff in statistical analysis and machine learning.
  2. The second part (external validity) deals with biases created by inevitable disparities between the conditions and populations under study versus those prevailing in the actual implementation of the treatment program or policy. Here, Deaton and Cartwright propose alternatives to RCT, calling all out for integrating a web of multiple information sources, including observational, experimental, quasi-experimental, and theoretical inputs, all collaborating towards the goal of estimating “what we are trying to discover”.

My only qualm with D&C’s proposal is that, in their passion to advocate the integration strategy, they have failed to notice that, in the past decade, a formal theory of integration strategies has emerged from the brewery of causal inference and is currently ready and available for empirical researchers to use. I am referring of course to the theory of Data Fusion which formalizes the integration scheme in the language of causal diagrams, and provides theoretical guarantees of feasibility and performance. (see http://www.pnas.org/content/pnas/113/27/7345.full.pdf )

Let us examine closely D&C’s main motto: “Which method is most likely to yield a good causal inference depends on what we are trying to discover as well as on what is already known.” Clearly, to cast this advice in practical settings, we must devise notation, vocabulary, and logic to represent “what we are trying to discover” as well as “what is already known” so that we can infer the former from the latter. To accomplish this nontrivial task we need tools, theorems and algorithms to assure us that what we conclude from our integrated study indeed follows from those precious pieces of knowledge that are “already known.” D&C are notably silent about the language and methodology in which their proposal should be carried out. One is left wondering therefore whether they intend their proposal to remain an informal, heuristic guideline, similar to Bradford Hill’s Criteria of the 1960’s, or be explicated in some theoretical framework that can distinguish valid from invalid inference? If they aspire to embed their integration scheme within a coherent framework, then they should celebrate; Such a framework has been worked out and is now fully developed.

To be more specific, the Data Fusion theory described in http://www.pnas.org/content/pnas/113/27/7345.full.pdf provides us with notation to characterize the nature of each data source, the nature of the population interrogated, whether the source is an observational or experimental study, which variables are randomized and which are measured and, finally, the theory tells us how to fuse all these sources together to synthesize an estimand of the target causal quantity at the target population. Moreover, if we feel uncomfortable about the assumed structure of any given data source, the theory tells us whether an alternative source can furnish the needed information and whether we can weaken any of the model’s assumptions.

Those familiar with Data Fusion theory will find it difficult to understand why D&C have not utilized it as a vehicle to demonstrate the feasibility of their proposed alternatives to RCT’s. This enigma stands out in D&C’s description of how modern analysis can rectify the deficiencies of RCTs, especially those pertaining to generalizing across populations, extrapolating across settings, and controlling for selection bias.

Here is what D&C’s article says about extrapolation (Quoting from their Section 3.5, “Re-weighting and stratifying”): “Pearl and Bareinboim (2011, 2014) and Bareinboim and Pearl (2013, 2014)
provide strategies for inferring information about new populations from trial results that are more general than re-weighting. They suppose we have available both causal information and probabilistic information for population A (e.g. the experimental one), while for population B (the target) we have only (some) probabilistic information, and also that we know that certain probabilistic and causal facts are shared between the two and certain ones are not. They offer theorems describing what causal conclusions about population B are thereby fixed. Their work underlines the fact that exactly what conclusions about one population can be supported by information about another depends on exactly what causal and probabilistic facts they have in common.”

The text is accurate up to this point, but then it changes gears and states: “But as Muller (2015) notes, this, like the problem with simple re-weighting, takes us back to the situation that RCTs are designed to avoid, where we need to start from a complete and correct specification of the causal structure. RCTs can avoid this in estimation which is one of their strengths, supporting their credibility but the benefit vanishes as soon as we try to carry their results to a new context. ” I believe D&C miss the point about re-weighing and stratifying.

First, it is not the case that “this takes us back to the situation that RCTs are designed to avoid.” It actually takes us to a more manageable situation. RCTs are designed to neutralize the confounding of treatments, whereas our methods are designed to neutralize differences between populations. Researchers may be totally ignorant of the structure of the former and quite knowledgeable about the structure of the latter. To neutralize selection bias, for example, we need to make assumptions about the process of recruiting subjects for the trial, a process over which we have some control. There is a fundamental difference therefore between assumptions about covariates that determine patients choice of treatment and those that govern the selection of subjects — the latter is (partially) under our control. Replacing one set of assumptions with another, more defensible set, does not “take us back to the situation that RCTs are designed to avoid.” It actually takes us forward, towards the ultimate goal of causal inference — to base conclusions on scrutinizable assumptions, and to base their plausibility on scientific or substantive grounds.

Second, D&C overlook the significance of the “completeness” results established for transportability problems (see http://ftp.cs.ucla.edu/pub/stat_ser/r390-L.pdf). Completeness tells us, in essence, that one cannot do any better. In other words, it delineates precisely the minimum set of assumptions that are needed to establish consistent estimate of causal effects in the target population. If any of those assumptions are violated we know that we can do only worse. From a mathematical (and philosophical) viewpoint, this is the most one can expect analysis to do for us and, therefore, completeness renders the generalizability problem “solved.”

Finally, the completeness result highlights the broad implications of the Data Fusion theory, and how it brings D&C’s desiderata closer to becoming a working methodology. Completeness tells us that any envisioned strategy of study integration is either embraceable in the structure-based framework of Data Fusion, or it is not workable in any framework. This means that one cannot dismiss the conclusions of Data Fusion theory on the grounds that: “Its assumptions are too strong.” If a set of assumptions is deemed necessary in the Data Fusion analysis, then it is necessary period; it cannot be avoided or relaxed, unless it is supplemented by other assumptions elsewhere, and the algorithm can tell you where.

It is hard to see therefore why any of D&C’s proposed strategies would resist formalization, analysis and solution within the current logic of modern causal inference.

It took more than a dozen years for researchers to accept the notion of  completeness in the context of internal validity. Utilizing the tools of the do-calculus (Pearl, 1995, Tian and Pearl, 2001, Shpitser & Pearl, 2006) completeness tells us what assumptions are absolutely needed for nonparametric identification of causal effects, how to tell if they are satisfied in any specific problem description, and how to use them to extract causal parameters from non-experimental studies. Completeness in external validity context is a relatively new result (See: http://ftp.cs.ucla.edu/pub/stat_ser/r443.pdf) which will probably take a few more years for enlightened researchers to accept, appreciate and to fully utilize. One purpose of this post is to urge the research community, especially Deaton and Cartwright to study the recent mathematization of externaly validity and to benefit from its implications.

I would be very interested in seeing other readers reaction to D&C’s article, as well as to my optimistic assessment of what causal inference can do for us in this day and age. I have read the reactions of Andrew Gelman (on his blog) and Stephen J. Senn (on Deborah Mayo’s blog https://errorstatistics.com/2018/01/), but they seem to be unaware of the latest developments in Data Fusion analysis. I also invite Angus Deaton and Nancy Cartwright to share a comment or two on these issues. I hope they respond positively.

Looking forward to your comments,
Judea


Addendum to “Challenging the Hegemony of RCTs”
Upon re-reading the post above I realized that I have assumed readers to be familiar with Data Fusion theory. This Addendum aims at readers who are not familiar with the theory, who would probably be asking: “Who needs a new theory to do what statistics does so well?” “Once we recognize the importance of diverse  sources of data, statistics can be helpful in making decisions and quantifying uncertainty.” [Quoted from Andrew Gelman’s blog]. The reason I question the sufficiency of statistics to manage the integration of diverse sources of data is that statistics lacks the vocabulary needed for the job. Let us demonstrate it in a couple of toy examples, taken from BP-2015 (http://ftp.cs.ucla.edu/pub/stat_ser/r450-reprint.pdf).

Example 1
Suppose we wish to estimate the average causal effect of X on Y, and we have two diverse sources of data:

  1. An RCT in which Z, not X, is randomized, and
  2. An observational study in which X, Y, and Z are measured.

What substantive assumptions are needed to facilitate a solution to our problem? Put another way, how can we be sure that, once we make those assumptions, we can solve our problem?

Example 2
Suppose we wish to estimate the average causal effect ACE of X on Y, and we have two diverse sources of data:

  1. An RCT in which the effect of X on both Y and Z is measured, but the recruited subjects had non-typical values of Z.
  2. An observational study conducted in the target population, in which both X and Z (but not Y) were measured.

What substantive assumptions would enable us to estimate ACE, and how should we combine data from the two studies so as to synthesize a consistent estimate of ACE?

The nice thing about a toy example is that the solution is known to us in advance, and so, we can check any proposed solution for correctness. Curious readers can find the solutions for these two examples in
http://ftp.cs.ucla.edu/pub/stat_ser/r450-reprint.pdf. More ambitious readers will probably try to solve them using statistical techniques, such as meta analysis or partial pooling. The reason I am confident that the second group will end up with disappointment comes from a profound statement made by Nancy Cartwright in 1989: “No Causes In, No Causes Out”. It means not only that you need substantive assumptions to derive causal conclusions; it also means that the vocabulary of statistical analysis, since it is built entirely on properties of distribution functions, is inadequate for expressing those substantive assumptions that are needed for getting causal conclusions. In our examples, although part of the data is provided by an RCT, hence it provides causal information, one can still show that the needed assumptions must invoke causal vocabulary; distributional assumptions are insufficient. As someone versed in both graphical modeling and counterfactuals, I would go even further and state that it would be a miracle if anyone succeeds in  translating the needed assumptions into a comprehensible language other than causal diagrams. (See http://ftp.cs.ucla.edu/pub/stat_ser/r452-reprint.pdf Appendix, Scenario 3.)

Armed with these examples and findings, we can go back and re-examine why D&C do not embrace the Data Fusion methodology in their quest for integrating diverse sources of data. The answer, I conjecture, is that D&C were not intimately familiar with what this methodology offers us, and how vastly different it is from previous attempts to operationalize Cartwright’s dictum: “No causes in, no causes out”.
Judea

Powered by WordPress