# Causal Analysis in Theory and Practice

## April 29, 2021

### Personalized Decision Making

Filed under: Counterfactual — Scott Mueller @ 5:53 pm

Scott Mueller and Judea Pearl

## Abstract

Personalized decision making targets the behavior of a specific individual, while population-based decision making concerns a sub-population resembling that individual. This paper clarifies the distinction between the two and explains why the former leads to more informed decisions. We further show that by combining experimental and observational studies we can obtain valuable information about individual behavior and, consequently, improve decisions over those obtained from experimental studies alone.

## Introduction

The purpose of this paper is to provide a conceptual understanding of the distinction between personalized and population-based decision making, and to demonstrate both the advantages of the former and how it could be achieved.

Formally, this distinction is captured in the following two causal effects. Personalized decision making optimizes the Individual Causal Effect (ICE):

$$$$\text{ICE}(u) = Y(1,u)-Y(0,u)$$$$

where $$Y(x,u)$$ stands for the outcome that individual $$u$$ would attain had decision $$x \in \{1, 0\}$$ been taken. In contrast, population-based decision making optimizes the Conditional Average Causal Effect (CACE):

$$$$\text{CACE}(u) = E[Y(1,u’) – Y(0,u’) | C(u’) = C(u)]$$$$

where $$C(u)$$ stands for a vector of characteristics observed on individual $$u$$, and the average is taken over all units $$u’$$ that share these characteristics.

We will show in this paper that the two objective functions lead to different decision strategies and that, although $$\text{ICE}(u)$$ is in general not identifiable, informative bounds can nevertheless be obtained by combining experimental and observational studies. We will further demonstrate how these bounds can improve decisions that would otherwise be taken using $$\text{CACE}(u)$$ as an objective function.

The paper is organized as follows. Section 2 will demonstrate, using an extreme example, two rather surprising findings. First, that population data are capable of providing decisive information on individual response and, second, that non-experimental data, usually discarded as bias-prone, can add information (regarding individual response) beyond that provided by a Randomized Controlled Trial (RCT) alone. Section 3 will generalize these findings using a more realistic example, and will further demonstrate how critical decisions can be made using the information obtained and their ramifications to both the targeted individual and to a population-minded policy maker. Section 4 casts the findings of Section 3 in a numerical setting, allowing for a quantitative appreciation of the magnitudes involved. This analysis leads to actionable policies that guarantee risk-free benefits in certain populations.

## Preliminary Semi-qualitative Example

Our target of analysis is an individual response to a given treatment, namely, how an individual would react if given treatment and if denied treatment. Since no individual can be subjected to both treatment and its denial, its response function must be inferred from population data, originating from one or several studies. We are asking therefore: to what degree can population data inform us about an individual response?

Before tackling this general question, we wish to address two conceptual hurdles. First, why should population data provide any information whatsoever on the individual response and, second, why should non-experimental data add any information (regarding individual response) to what we can learn with an RCT alone. The next simple example will demonstrate both points.

We conduct an RCT and find no difference between treatment (drug) and control (placebo), say $$10\%$$ in both treatment and control groups die, while the rest ($$90\%$$) survive. This makes us conclude that the drug is ineffective, but also leaves us uncertain between (at least) two competing models:

• Model-1 — The drug has no effect whatsoever on any individual and
• Model-2 — The drug saves $$10\%$$ of the population and kills another $$10\%$$.

From a policy maker viewpoint the two models may be deemed equivalent, the drug has zero average effect on the target population. But from an individual viewpoint the two models differ substantially in the sets of risks and opportunities they offer. According to Model-1, the drug is useless but safe. According to Model-2, however, the drug may be deemed dangerous by some and a life-saver by others.

To see how such attitudes may emerge, assume, for the sake of argument, that the drug also provides temporary pain relief. Model-1 would be deemed desirable and safe by all, whereas Model-2 will scare away those who do not urgently need the pain relief, while offering a glimpse of hope to those whose suffering has become unbearable, and who would be ready to risk death for the chance ($$10\%$$) of recovery. (Hoping, of course, they are among the lucky beneficiaries.)

This simple example will also allow us to illustrate the second theme of our paper – the crucial role of observational studies. We will now show that supplementing the RCT with an observational study on the same population (conducted, for example, by an independent survey of patients who have the option of taking or avoiding the drug) would allow us to decide between the two models, totally changing our understanding of what risks await an individual taking the drug.

Consider an extreme case where the observational study shows $$100\%$$ survival in both drug-choosing and drug-avoiding patients, as if each patient knew in advance where danger lies and managed to avoid it. Such a finding, though extreme and unlikely, immediately rules out Model-1 which claims no treatment effect on any individual. This is because the mere fact that patients succeed $$100\%$$ of the time to avoid harm where harm does exist (revealed through the $$10\%$$ death in the randomized trial) means that choice makes a difference, contrary to Model-1’s claim that choice makes no difference.

The reader will surely see that the same argument applies when the probability of survival among option-having individuals is not precisely $$100\%$$ but simply higher (or lower) than the probability of survival in the RCT. Using the RCT study alone, in contrast, we were unable to rule out Model-1, or even to distinguish Model-1 from Model-2.

We now present another edge case where Model-2, rather than Model-1, is ruled out as impossible. Assume the observational study informs us that all those who chose the drug died and all who avoided the drug survived. It seems that drug-choosers were truly dumb while drug-avoiders knew precisely what’s good for them. This is perfectly feasible, but it also tells us that no one can be cured by the drug, contrary to the assertion made by Model-2, that the drug cures $$10\%$$ and kills $$10\%$$. To be cured, a person must survive if treated and die if not treated. But none of the drug-choosers were cured, because they all died, and none of the drug avoiders were cured because they all survived. Thus, Model-2 cannot explain these observational results, and must be ruled out.

Now that we have demonstrated conceptually how certain combinations of observational and experimental data can provide information on individual behavior that each study alone cannot, we are ready to go to a more realistic motivating example which, based on theoretical bounds derived in Tian and Pearl, 2000, establishes individual behavior for any combination of observational and experimental data1 and, moreover, demonstrates critical decision making ramifications of the information obtained.

##### Footnote
1
The example we will work out happened to be identifiable due to particular combinations of data, though, in general, the data may not permit point estimates of individual causal effects.

## Motivating Numerical Example

The next example to be considered deals with the effect of a drug on two subpopulations, males and females. Unlike the extreme case considered in Section 2, the drug is found to be somewhat effective for both males and females and, in addition, deaths are found to occur in the observational study as well.

To cast the story in a realistic setting, we imagine the testing of a new drug, aimed to help patients suffering from a deadly disease. An RCT is conducted to evaluate the efficacy of the drug and is found to be $$28\%$$ effective in both males and females. In other words $$\text{CACE}(\text{male}) = \text{CACE}(\text{female}) = 0.28$$. The drug is approved and, after a year of use, a follow up randomized study is conducted yielding the same results; namely CACE remained 0.28, and men and women remained totally indistinguishable in their responses, as shown in Table 1.

Experimental
$$do(\text{drug})$$ $$do(\text{no drug})$$ $$\text{CACE}$$
Female Survivals 489/1000 (49%) 210/1000 (21%) 28%
Male Survivals 490/1000 (49%) 210/1000 (21%) 28%
Table 1: Female vs male CACE
Female Data
Experimental Observational
$$do(\text{drug})$$ $$do(\text{no drug})$$ $$\text{drug}$$ $$\text{no drug}$$
Survivals 489 (49%) 210 (21%) 378 (27%) 420 (70%)
Deaths 511 (51%) 790 (79%) 1,022 (73%) 180 (30%)
Total 1,000 (50%) 1,000 (50%) 1,400 (70%) 600 (30%)
Table 2: Female survival and recovery data
Male Data
Experimental Observational
$$do(\text{drug})$$ $$do(\text{no drug})$$ $$\text{drug}$$ $$\text{no drug}$$
Survivals 490 (49%) 210 (21%) 980 (70%) 420 (70%)
Deaths 510 (51%) 790 (79%) 420 (30%) 180 (30%)
Total 1,000 (50%) 1,000 (50%) 1,400 (70%) 600 (30%)
Table 3: Male survival and recovery data

Let us focus on the second RCT (Table 1), since the first was used for drug approval only, and its findings are the same as the second. The RCT tells us that there was a $$28\%$$ improvement, on average, in taking the drug compared to not taking the drug. This was the case among both females and males: $$\text{CACE}(\text{female}) = \text{CACE}(\text{male}) = 0.28$$, where $$do(\text{drug})$$ and $$do(\text{no-drug})$$ are the treatment and control arms in the RCT. It thus appears reasonable to conclude that the drug has a net remedial effect on some patients and that every patient, be it male or female, should be advised to take the drug and benefit from its promise of increasing one’s chances of recovery (by $$28\%$$).

At this point, the drug manufacturer ventured to find out to what degree people actually buy the approved drug, following its recommended usage. A market survey was conducted (observational study) and revealed that only $$70\%$$ of men and $$70\%$$ of women actually chose to take the drug; problems with side effects and rumors of unexpected deaths may have caused the other $$30\%$$ to avoid it. A careful examination of the observational study has further revealed substantial differences in survival rates of men and women who chose to use the drug (shown in Tables 2 and 3). The rate of recovery among drug-choosing men was exactly the same as that among the drug-avoiding men ($$70\%$$ for each), but the rate of recovery among drug-choosing women was $$43\%$$ lower than among drug-avoiding women ($$0.27$$ vs $$0.70$$, in Table 2). It appears as though many women who chose the drug were already in an advanced stage of the disease, which may account for their low recovery rate of $$27\%$$.

At this point, having data from both experimental and observational studies we can estimate the individual treatment effects for both a typical man and a typical woman. Quantitative analysis shows (see Section 4) that, with the data above, the drug affects men markedly differently from the way it affects women. Whereas a woman has a $$28\%$$ chance of benefiting from the drug and no danger at all of being harmed by it, a man has a $$49\%$$ chance of benefiting from it and as much as a $$21\%$$ chance of dying because of it — a serious cause for concern. Note that based on the experimental data alone (Table 1), no difference at all can be noticed between men and women.

The ramifications of these findings on personal decision making are enormous. First, they tell us that the drug is not as safe as the RCT would have us believe, it may cause death in a sizable fraction of patients. Second, they tell us that a woman is totally clear of such dangers, and should have no hesitation to take the drug, unlike a man, who faces a decision; a $$21\%$$ chance of being harmed by the drug is cause for concern. Physicians, likewise, should be aware of the risks involved before recommending the drug to a man. Third, the data tell policy makers what the overall societal benefit would be if the drug is administered to women only; $$28\%$$ of the drug-takers would survive who would die otherwise. Finally, knowing the relative sizes of the benefiting vs harmed subpopulations swings open the door for finding the mechanisms responsible for the differences as well as identifying measurable markers that characterize those subpopulations.

For example:

• In the same way that our analysis has identified “Sex” to be an important feature, separating those who are harmed from those saved by the drug, so we can leverage other measured features, say family history, a genetic marker, or a side-effect, and check whether they shrink the sizes of the susceptible subpopulations. The results would be a set of features that approximate responses at the individual level. Note again that absent observational data and a calculus for combining them with the RCT data, we would not be able to identify such informative features. A feature like “Sex” would be deemed irrelevant, since men and women were indistinguishable in our RCT studies.
• Our ability to identify relevant informative features as described above can be leveraged to amplify the potential benefits of the drug. For example, if we identify a marker that characterizes men who would die only if they take the drug and prevent those patients from taking the drug, the drug would cure $$62\%$$ of male patients who would be allowed to use it. This is because we don’t administer the drug to the $$21\%$$ who would’ve been killed by the drug. Those patients will now survive, so a total of $$70\%$$ of patients will be cured because of this combination of marker identification and drug administration. This unveils an enormous potential of the drug at hand, which was totally concealed by the $$28\%$$ effectiveness estimated in the RCT studies.

## How the Results Were Obtained

For the purpose of analysis, let us denote $$y_t$$ as recovery among the RCT treatment group and $$y_c$$ as recovery among the RCT control group. The causal effects for treatment and control groups, $$P(y_t|\text{Gender})$$ and $$P(y_c|\text{Gender})$$, were the same2, no differences were noted between males and females.

In addition to the above RCT3, we posited an observational study (survey) conducted on the same population. Let us denote $$P(y|t, \text{Gender})$$ and $$P(y|c, \text{Gender})$$ as recovery among the drug-choosers and recovery among the drug-avoiders, respectively.

With this notation at hand, our problem is to compute the probability of benefit

$$$$P(\text{benefit}) = P(y_t, y’_c)$$$$

from the following data sources: $$P(y_t)$$, $$P(y_c)$$, $$P(y|t)$$, $$P(y|c)$$, and $$P(t)$$. The first two denote the data obtained from the RCT and the last three, data obtained from the survey. Eq. (3) should be interpreted as the probability that an individual would both recover if assigned to the RCT treatment arm and die if assigned to control4.

Connecting the experimental and observational data is an important assumption known as consistency (Pearl, 2009, 2010)5. In other words, we assume that the units selected for an observational or experimental study are drawn from the same population and that their response to treatment is purely biological, unaffected by their respective settings.

In other words, the outcome of a person choosing the drug would be the same had this person been assigned to the treatment group in an RCT study. Similarly, if we observe someone avoiding the drug, their outcome is the same as if they were in the control group of our RCT. Deviation from consistency, normally attributed to uncontrolled “placebo effects”, should be dealt with by explicitly representing such factors in the model.

In terms of our notation, consistency implies:

$$$$P(y_t|t)= P(y|t), P(y_c|c)= P(y|c).$$$$

In words, the probability that a drug-chooser would recover in the treatment arm of the RCT, $$P(y_t|t)$$, is the same as the probability of recovery in the observational study, $$P(y|t)$$.

Based on this assumption, and leveraging both experimental and observational data, Tian and Pearl (Tian and Pearl, 2000) derived the following tight bounds on the probability of benefit, as defined in (3):

$$$$\max\left\{\begin{array}{c} 0,\\ P(y_t) – P(y_c),\\ P(y) – P(y_c),\\ P(y_t) – P(y)\\ \end{array}\right\} \leqslant P(\text{benefit}) \leqslant \min\left\{\begin{array}{c} P(y_t),\\ P(y’_c),\\ P(t,y) + P(c,y’),\\ P(y_t) – P(y_c)\ +\\ \ P(t, y’) + P(c, y) \end{array}\right\}.$$$$

Here $$P(y’_c)$$ stands for $$1-P(y_c)$$, namely the probability of death in the control group. The same bounds hold for any subpopulation, say males or females, if every term in (5) is conditioned on the appropriate class.

Applying these expressions to the female data from Table 2 gives the following bounds on $$P(\text{benefit}|\text{female})$$:

\begin{align} \max\{0, 0.279, 0.09, 0.189\} &\leqslant P(\text{benefit}|\text{female}) \leqslant \min\{0.489,0.79,0.279,1\},\nonumber\\ 0.279 &\leqslant P(\text{benefit}|\text{female}) \leqslant 0.279. \end{align}

Similarly, for men we get:

\begin{align} \max\{0, 0.28, 0.49, -0.21\} &\leqslant P(\text{benefit}|\text{male}) \leqslant \min\{0.49, 0.79, 0.58, 0.7\},\nonumber\\ 0.49 &\leqslant P(\text{benefit}|\text{male}) \leqslant 0.49. \end{align}

Thus, the bounds for both females and males, in (6) and (7), collapse to point estimates:

\begin{align*} P(\text{benefit}|\text{female}) &= 0.279,\\ P(\text{benefit}|\text{male}) &= 0.49. \end{align*}

We aren’t always so fortunate to have a complete set of observational and experimental data at our disposal. When some data is absent, we are allowed to discard arguments to $$\max$$ or $$\min$$ in (5) that depend on that data. For example, if we lack all experimental data, the only applicable lower bound in (5) is $$0$$ and the only applicable upper bound is $$P(t, y) + P(c, y’)$$:

$$$$0 \leqslant P(\text{benefit}) \leqslant P(t, y) + P(c, y’).$$$$

Applying these observational data only bounds to males and females yields:

\begin{align*} 0 &\leqslant P(\text{benefit}|\text{female}) \leqslant 0.279,\\ 0 &\leqslant P(\text{benefit}|\text{male}) \leqslant 0.58. \end{align*}

Naturally, these are far more loose than the point estimates when combined experimental and observational data is fully available. Let’s similarly examine what can be computed with purely experimental data. Without observational data, only the first two arguments to $$\max$$ of the lower bound and $$\min$$ of the upper bound of $$P(\text{benefit})$$ in (5) are applicable:

$$$$\max\{0,P(y_t)-P(y_c)\} \leqslant P(\text{benefit}) \leqslant \min\{P(y_t),P(y’_c)\}.$$$$

Applying these experimental data only bounds to males and females yields:

\begin{align*} 0.279 &\leqslant P(\text{benefit}|\text{female}) \leqslant 0.489,\\ 0.28 &\leqslant P(\text{benefit}|\text{male}) \leqslant 0.49. \end{align*}

Again, these are fairly loose bounds, especially when compared to the point estimates obtained with combined data. Notice that the overlap between the female bounds using observational data, $$0 \leqslant P(\text{benefit}|\text{female}) \leqslant 0.279$$, and the female bounds using experimental data, $$0.279 \leqslant P(\text{benefit}|\text{female}) \leqslant 0.489$$ is the point estimate $$P(\text{benefit}|\text{female}) = 0.279$$. The more comprehensive Tian-Pearl bounds formula (5) wasn’t necessary. However, the intersection of the male bounds using observational data, $$0 \leqslant P(\text{benefit}|\text{male}) \leqslant 0.58$$, and the male bounds using experimental data, $$0.28 \leqslant P(\text{benefit}|\text{male}) \leqslant 0.49$$, does not provide us with narrower bounds. For males, the comprehensive Tian-Pearl bounds in (5) was necessary for narrow bounds (in this case, a point estimate).

Having seen this mechanism of combining observational and experimental data in (5) work so well, the reader may ask what’s behind this? The intuition comes from the fact that observational data incorporates individuals’ whims. Whimsy is a proxy for much deeper behavior. This leads to confounding, which is ordinarily problematic for causal inference and leads to spurious conclusions, sometimes completely reversing a treatment’s effect (Pearl, 2014). Confounding then needs to be adjusted for. However, here confounding helps us, exposing the underlying mechanisms its associated whims and desires are a proxy for.

Finally, as noted in Section 3, knowing the relative sizes of the benefiting vs harmed subpopulations demands investment in finding mechanisms responsible for the differences as well as characterizations of those subpopulations. For example, women above a certain age may be affected differently by the drug, to be detected by how age affects the bounds on the individual response. Such characteristics can potentially be narrowed repeatedly until the drug’s efficacy can be predicted for an individual with certainty or the underlying mechanisms of the drug can be fully understood.

None of this was possible with only the RCT. Yet, remarkably, an observational study, however sloppy and uncontrolled, provides a deeper perspective on a treatment’s effectiveness. It incorporates individuals’ whims and desires that govern behavior under free-choice settings. And, since such whims and desires are often proxies for factors that also affect outcomes and treatments (i.e., confounders), we gain additional insight hidden by RCTs.

##### Footnote
2
$$P(y_t|\text{female})$$ was rounded up from $$48.9\%$$ to $$49\%$$. The $$0.001$$ difference between $$P(y_t|\text{female}$$) and $$P(y_t|\text{male})$$ wasn’t necessary, but was constructed to allow for clean point estimates.
3
To simplify matters, we are treating each experimental study data as an ideal RCT, with $$100\%$$ compliance and no selection bias or any other biases that can often plague RCTs.
4
Tian and Pearl (Tian and Pearl, 2000) called $$P(\text{benefit})$$ “Probability of Necessity and Sufficiency” (PNS). The relationship between PNS and ICE (1) is explicated in Section 5
5
Consistency is a property imposed at the individual level, often written as $$\begin{equation*} Y = X \cdot Y(1) + (1-X) \cdot Y(0) \end{equation*}$$ for binary X and Y. Rubin (Rubin, 1974) considered consistency to be an assumption in SUTVA, which defines the potential outcome (PO) framework. Pearl (Pearl, 2010) considered consistency to be a theorem of Structural Equation Models.

## Annotated Bibliography for Related Works

The following is a list of papers that analyze probabilities of causation and lead to the results reported above.

• Chapter 9 of Causality (Pearl, 2009) derives bounds on individual-level probabilities of causation and discusses their ramifications in legal settings. It also demonstrates how the bounds collapse to point estimates under certain combinations of observational and experimental data.
• (Tian and Pearl, 2000) develops bounds on individual level causation by combining data from experimental and observational studies. This includes Probability of Sufficiency (PS), Probability of Necessity (PN), and Probability of Necessity and Sufficiency (PNS). PNS is equivalent to $$P(\text{benefit})$$ above. $$\text{PNS}(u) = P(\text{benefit}|u)$$, the probability that individual $$U=u$$ survives if treated and does not survive if not treated, is related to $$\text{ICE}(u)$$ (1) via the equation: $$$$\text{PNS}(u) = P(\text{ICE}(u’) > 0 | C(u’) = C(u)).$$$$

In words, $$\text{PNS}(u)$$ equals the proportion of units $$u’$$ sharing the characteristics of $$u$$ that would positively benefit from the treatment. The reason is as follows. Recall that (for binary variables) $$\text{ICE}(u)$$ is $$1$$ when the individual benefits from the treatment, $$\text{ICE}(u)$$ is $$0$$ when the individual responds the same to either treatment, and $$\text{ICE}(u)$$ is $$-1$$ when the individual is harmed by treatment. Thus, for any given population, $$\text{PNS} = P(\text{ICE}(u) > 0)$$. Focusing on the sub-population of individuals $$u’$$ that share the characteristics of $$u$$, $$C(u’) = C(u)$$, we obtain (10). In words, $$\text{PNS}(u)$$ is the fraction of indistinguishable individuals that would benefit from treatment. Note that whereas (2) is can be estimated by controlled experiments over the population $$C(u’)=C(u)$$, (10) is defined counterfactually, hence, it cannot be estimated solely by such experiments; it requires additional ingredients as described in the text below.

• (Mueller and Pearl, 2020) provides an interactive visualization of individual level causation, allowing readers to observe the dynamics of the bounds as one changes the available data.
• (Li and Pearl, 2019) optimizes societal benefit of selecting a unit $$u$$, when provided costs associated with the four different types of individuals, benefiting, harmed, always surviving, and doomed.
• (Mueller et al., 2021) takes into account the causal graph to obtain narrower bounds on PNS. The hypothetical study in this article was able to calculate point estimates of PNS, but often the best we can get are bounds.
• (Pearl, 2015) demonstrates how combining observational and experimental data can be informative for determining Causes of Effects, namely, assessing the probability PN that one event was a necessary cause of an observed outcome.
• (Dawid and Musio, 2022) analyze Causes of Effects (CoE), defined by PN, the probability that a given intervention is a necessary cause for an observed outcome. Dawid and Musio further analyze whether bounds on PN can be narrowed with data on mediators.

## Conclusion

One of the least disputed mantra of causal inference is that we cannot access individual causal effects; we can observe an individual response to treatment or to no-treatment but never to both. However, our theoretical results show that we can get bounds on individual causal effects, which sometimes can be quite narrow and allow us to make accurate personalized decisions. We project therefore that these theoretical results are key for next-generation personalized decision making.

## References

1. Dawid, A. P., & Musio, M. (2022). Effects of causes and causes of effects. Annual Review of Statistics and its Application. https://arxiv.org/pdf/2104.00119.pdf
2. Li, A., & Pearl, J. (2019). Unit selection based on counterfactual logic. Proceedings of the 28th International Joint Conference on Artificial Intelligence, 1793-1799.
3. Mueller, S., Li, A., & Pearl, J. (2021). Causes of effects: Learning individual responses from population data. http://ftp.cs.ucla.edu/pub/statser/r505.pdf
4. Mueller, S., & Pearl, J. (2020). Which Patients are in Greater Need: A counterfactual analysis with reflections on COVID-19 [https://ucla.in/39Ey8sU+].
5. Pearl, J. (2009). Causality (Second). Cambridge University Press.
6. Pearl, J. (2010). On the consistency rule in causal inference: An axiom, definition, assumption, or a theorem? Epidemiology, 21(6), 872-875. https://ftp.cs.ucla.edu/pub/stat_ser/r358-reprint.pdf
7. Pearl, J. (2014). Understanding simpson’s paradox. The American Statistician, 68(1), 8-13. http://ftp.cs.ucla.edu/pub/stat_ser/r414-reprint.pdf
8. Pearl, J. (2015). Causes of effects and effects of causes. Journal of Sociological Methods and Research, 44(1), 149-164. http://ftp.cs.ucla.edu/pub/stat_ser/r431-reprint.pdf
9. Rubin, D. B. (1974). Estimating causal effects of treatments in randomized andnonrandomized studies. Journal of Educational Psychology, 66(5), 688-701. https://doi.org/10.1037/h0037350
10. Tian, J., & Pearl, J. (2000). Probabilities of causation: Bounds and identification. Annals of Mathematics and Artificial Intelligence, 28(1-4), 287-313. http://ftp.cs.ucla.edu/pub/stat_ser/r271-A.pdf

## July 7, 2020

### Data versus Science: Contesting the Soul of Data-Science

Filed under: Book (J Pearl),Counterfactual,Data Fusion — judea @ 1:02 pm

Summary
The post below is written for the upcoming Spanish translation of The Book of Why, which was announced today. It expresses my firm belief that the current data-fitting direction taken by “Data Science” is temporary (read my lips!), that the future of “Data Science” lies in causal data interpretation and that we should prepare ourselves for the backlash swing.

Data versus Science: Contesting the Soul of Data-Science
Much has been said about how ill-prepared our health-care system was in coping with catastrophic outbreaks like COVID-19. Yet viewed from the corner of my expertise, the ill-preparedness can also be seen as a failure of information technology to keep track of and interpret the outpour of data that have arrived from multiple and conflicting sources, corrupted by noise and omission, some by sloppy collection and some by deliberate misreporting, AI could and should have equipped society with intelligent data-fusion technology, to interpret such conflicting pieces of information and reason its way out of the confusion.

Speaking from the perspective of causal inference research, I have been part of a team that has developed a complete theoretical underpinning for such “data-fusion” problems; a development that is briefly described in Chapter 10 of The Book of Why. A system based on data fusion principles should be able to attribute disparities between Italy and China to differences in political leadership, reliability of tests and honesty in reporting, adjust for such differences and automatically infer behavior in countries like Spain or the US. AI is in a position to to add such data-interpreting capabilities on top of the data-fitting technologies currently in use and, recognizing that data are noisy, filter the noise and outsmart the noise makers.

“Data fitting” is the name I frequently use to characterize the data-centric thinking that dominates both statistics and machine learning cultures, in contrast to the “data-interpretation” thinking that guides causal inference. The data-fitting school is driven by the faith that the secret to rational decisions lies in the data itself, if only we are sufficiently clever at data mining. In contrast, the data-interpreting school views data, not as a sole object of inquiry but as an auxiliary means for interpreting reality, and “reality” stands for the processes that generate the data.

I am not alone in this assessment. Leading researchers in the “Data Science” enterprise have come to realize that machine learning as it is currently practiced cannot yield the kind of understanding that intelligent decision making requires. However, what many fail to realize is that the transition from data-fitting to data-understanding involves more than a technology transfer; it entails a profound paradigm shift that is traumatic if not impossible. Researchers whose entire productive career have committed them to the supposition that all knowledge comes from the data cannot easily transfer allegiance to a totally alien paradigm, according to which extra-data information is needed, in the form of man-made, causal models of reality. Current machine learning thinking, which some describe as “statistics on steroids,” is deeply entrenched in this self-propelled ideology.

Ten years from now, historians will be asking: How could scientific leaders of the time allow society to invest almost all its educational and financial resources in data-fitting technologies and so little on data-interpretation science? The Book of Why attempts to answer this dilemma by drawing parallels to historically similar situations where ideological impediments held back scientific progress. But the true answer, and the magnitude of its ramifications, will only be unravelled by in-depth archival studies of the social, psychological and economical forces that are currently governing our scientific institutions.

A related, yet perhaps more critical topic that came up in handling the COVID-19 pandemic, is the issue of personalized care. Much of current health-care methods and procedures are guided by population data, obtained from controlled experiments or observational studies. However, the task of going from these data to the level of individual behavior requires counterfactual logic, which has been formalized and algorithmatized in the past 2 decades (as narrated in Chapter 8 of The Book of Why), and is still a mystery to most machine learning researchers.

The immediate area where this development could have assisted the COVID-19 pandemic predicament concerns the question of prioritizing patients who are in “greatest need” for treatment, testing, or other scarce resources. “Need” is a counterfactual notion (i.e., patients who would have gotten worse had they not been treated) and cannot be captured by statistical methods alone. A recently posted blog page https://ucla.in/39Ey8sU demonstrates in vivid colors how counterfactual analysis handles this prioritization problem.

The entire enterprise known as “personalized medicine” and, more generally, any enterprise requiring inference from populations to individuals, rests on counterfactual analysis, and AI now holds the key theoretical tools for operationalizing this analysis.

People ask me why these capabilities are not part of the standard tool sets available for handling health-care management. The answer lies again in training and education. We have been rushing too eagerly to reap the low-lying fruits of big data and data fitting technologies, at the cost of neglecting data-interpretation technologies. Data-fitting is addictive, and building more “data-science centers” only intensifies the addiction. Society is waiting for visionary leadership to balance this over-indulgence by establishing research, educational and training centers dedicated to “causal science.”

I hope it happens soon, for we must be prepared for the next pandemic outbreak and the information confusion that will probably come in its wake.

## January 29, 2020

### On Imbens’s Comparison of Two Approaches to Empirical Economics

Filed under: Counterfactual,d-separation,DAGs,do-calculus,Imbens — judea @ 11:00 pm

Many readers have asked for my reaction to Guido Imbens’s recent paper, titled, “Potential Outcome and Directed Acyclic Graph Approaches to Causality: Relevance for Empirical Practice in Economics,” arXiv.19071v1 [stat.ME] 16 Jul 2019.

The note below offers brief comments on Imbens’s five major claims regarding the superiority of potential outcomes [PO] vis a vis directed acyclic graphs [DAGs].

These five claims are articulated in Imbens’s introduction (pages 1-3). [Quoting]:

” … there are five features of the PO framework that may be behind its current popularity in economics.”

I will address them sequentially, first quoting Imbens’s claims, then offering my counterclaims.

I will end with a comment on Imbens’s final observation, concerning the absence of empirical evidence in a “realistic setting” to demonstrate the merits of the DAG approach.

Before we start, however, let me clarify that there is no such thing as a “DAG approach.” Researchers using DAGs follow an approach called  Structural Causal Model (SCM), which consists of functional relationships among variables of interest, and of which DAGs are merely a qualitative abstraction, spelling out the arguments in each function. The resulting graph can then be used to support inference tools such as d-separation and do-calculus. Potential outcomes are relationships derived from the structural model and several of their properties can be elucidated using DAGs. These interesting relationships are summarized in chapter 7 of (Pearl, 2009a) and in a Statistical Survey overview (Pearl, 2009c)

Imbens’s Claim # 1
“First, there are some assumptions that are easily captured in the PO framework relative to the DAG approach, and these assumptions are critical in many identification strategies in economics. Such assumptions include
monotonicity ([Imbens and Angrist, 1994]) and other shape restrictions such as convexity or concavity ([Matzkin et al.,1991, Chetverikov, Santos, and Shaikh, 2018, Chen, Chernozhukov, Fernández-Val, Kostyshak, and Luo, 2018]). The instrumental variables setting is a prominent example, and I will discuss it in detail in Section 4.2.”

Pearl’s Counterclaim # 1
It is logically impossible for an assumption to be “easily captured in the PO framework” and not simultaneously be “easily captured” in the “DAG approach.” The reason is simply that the latter embraces the former and merely enriches it with graph-based tools. Specifically, SCM embraces the counterfactual notation Yx that PO deploys, and does not exclude any concept or relationship definable in the PO approach.

Take monotonicity, for example. In PO, monotonicity is expressed as

Yx (u) ≥ Yx’ (u) for all u and all x > x’

In the DAG approach it is expressed as:

Yx (u) ≥ Yx’ (u) for all u and all x > x’

(Taken from Causality pages 291, 294, 398.)

The two are identical, of course, which may seem surprising to PO folks, but not to DAG folks who know how to derive the counterfactuals Yx from structural models. In fact, the derivation of counterfactuals in
terms of structural equations (Balke and Pearl, 1994) is considered one of the fundamental laws of causation in the SCM framework see ) and (Pearl, 2015).

Imbens’s Claim # 2
“Second, the potential outcomes in the PO framework connect easily to traditional approaches to economic models such as supply and demand settings where potential outcome functions are the natural primitives. Related to this, the insistence of the PO approach on manipulability of the causes, and its attendant distinction between non-causal attributes and causal variables has resonated well with the focus in empirical work on policy relevance ([Angrist and Pischke, 2008, Manski, 2013]).”

Pearl’s Counterclaim #2
Not so. The term “potential outcome” is a late comer to the economics literature of the 20th century, whose native vocabulary and natural primitives were functional relationships among variables, not potential outcomes. The latters are defined in terms of a “treatment assignment” and hypothetical outcome, while the formers invoke only observable variables like “supply” and “demand”. Don Rubin cited this fundamental difference as sufficient reason for shunning structural equation models, which he labeled “bad science.”

While it is possible to give PO interpretation to structural equations, the interpretation is both artificial and convoluted, especially in view of PO insistence on manipulability of causes. Haavelmo, Koopman and Marschak would not hesitate for a moment to write the structural equation:

Damage = f (earthquake intensity, other factors).

PO researchers, on the other hand, would spend weeks debating whether earthquakes have “treatment assignments” and whether we can legitimately estimate the “causal effects” of earthquakes. Thus, what Imbens perceives as a helpful distinction is, in fact, an unnecessary restriction that suppresses natural scientific discourse. See also (Pearl, 2018; 2019).

Imbens’s Claim #3
“Third, many of the currently popular identification strategies focus on models with relatively few (sets of) variables, where identification questions have been worked out once and for all.”

Pearl’s Counterclaim #3

First, I would argue that this claim is actually false. Most IV strategies that economists use are valid “conditional on controls” (see examples listed in Imbens (2014))  and the criterion that distinguishes “good controls” from “bad controls” is not trivial to articulate without the help of graphs. (See, A Crash Course in Good and Bad Control). It can certainly not be discerned “once and for all”.

Second, even if economists are lucky to guess “good controls,” it is still unclear whether they focus  on relatively few variables because, lacking graphs, they cannot handle more variables, or do they refrain from using graphs to hide the opportunities missed by focusing on few pre-fabricated, “once and for all” identification strategies.

I believe both apprehensions play a role in perpetuating the graph-avoiding subculture among economists. I have elaborated on this question here: (Pearl, 2014).

Imbens’s Claim # 4
“Fourth, the PO framework lends itself well to accounting for treatment effect heterogeneity in estimands ([Imbens and Angrist, 1994, Sekhon and Shem-Tov, 2017]) and incorporating such heterogeneity in estimation and the design of optimal policy functions ([Athey and Wager, 2017, Athey, Tibshirani, Wager, et al., 2019, Kitagawa and Tetenov, 2015]).”

Pearl’s Counterclaim #4
Indeed, in the early 1990s, economists felt ecstatic liberating themselves from the linear tradition of structural equation models and finding a framework (PO) that allowed them to model treatment effect heterogeneity.

However, whatever role treatment heterogeneity played in this excitement should have been amplified ten-fold in 1995, when completely non parametric structural equation models came into being, in which non-linear interactions and heterogeneity were assumed a priori. Indeed, the tools developed in the econometric literature cover only a fraction of the treatment-heterogeneity tasks that are currently managed by SCM. In particular, the latter includes such problems as “necessary and sufficient” causation, mediation, external validity, selection bias and more.

Speaking more generally, I find it odd for a discipline to prefer an “approach” that rejects tools over one that invites and embraces tools.

Imbens’s claim #5
“Fifth, the PO approach has traditionally connected well with design, estimation, and inference questions. From the outset Rubin and his coauthors provided much guidance to researchers and policy makers for practical implementation including inference, with the work on the propensity score ([Rosenbaum and Rubin, 1983b]) an influential example.”

Pearl’s Counterclaim #5
The initial work of Rubin and his co-authors has indeed provided much needed guidance to researchers and policy makers who were in a state of desperation, having no other mathematical notation to express causal questions of interest. That happened because economists were not aware of the counterfactual content of structural equation models, and of the non-parametric extension of those models.

Unfortunately, the clumsy and opaque notation introduced in this initial work has become a ritual in the PO framework that has prevailed, and the refusal to commence the analysis with meaningful assumptions has led to several blunders and misconceptions. One such misconception has been propensity score analysis which researchers have taken as a tool for reducing confounding bias. I have elaborated on this misguidance in Causality, Section 11.3.5, “Understanding Propensity Scores” (Pearl, 2009a).

Imbens’s final observation: Empirical Evidence
“Separate from the theoretical merits of the two approaches, another reason for the lack of adoption in economics is that the DAG literature has not shown much evidence of the benefits for empirical practice in settings that are important in economics. The potential outcome studies in MACE, and the chapters in [Rosenbaum, 2017], CISSB and MHE have detailed empirical examples of the various identification strategies proposed. In realistic settings they demonstrate the merits of the proposed methods and describe in detail the corresponding estimation and inference methods. In contrast in the DAG literature, TBOW, [Pearl, 2000], and [Peters, Janzing, and Schölkopf, 2017] have no substantive empirical examples, focusing largely on identification questions in what TBOW refers to as “toy” models. Compare the lack of impact of the DAG literature in economics with the recent embrace of regression discontinuity designs imported from the psychology literature, or with the current rapid spread of the machine learning methods from computer science, or the recent quick adoption of synthetic control methods [Abadie, Diamond, and Hainmueller, 2010]. All came with multiple concrete examples that highlighted their benefits over traditional methods. In the absence of such concrete examples the toy models in the DAG literature sometimes appear to be a set of solutions in search of problems, rather than a set of solutions for substantive problems previously posed in social sciences.”

Pearl’s comments on: Empirical Evidence
There is much truth to Imbens’s observation. The PO excitement that swept natural experimentalists in the 1990s came with outright rejection of graphical models. The hundreds, if not thousands, of empirical economists who plunged into empirical work, were warned repeatedly that graphical models may be “ill-defined,” “deceptive,” and “confusing,” and structural models have no scientific underpinning (see (Pearl, 19952009b)). Not a single paper in the econometric literature has acknowledged the existence of SCM as an alternative or complementary approach to PO.

The result has been the exact opposite of what has taken place in epidemiology where DAGs became a second language to both scholars and field workers, [Due in part to the influential 1999 paper by Greenland, Pearl and Robins.] In contrast, PO-led economists have launched a massive array of experimental programs lacking graphical tools for guidance. I would liken it to a Phoenician armada exploring the Atlantic coast in leaky boats and no compass to guide its way.

This depiction might seem pretentious and overly critical, considering the pride with which natural experimentalists take in the results of their studies (though no objective verification of validity can be undertaken.) Yet looking back at the substantive empirical examples listed by Imbens, one cannot but wonder how much more credible those studies could have been with graphical tools to guide the way. These include a friendly language to communicate assumptions, powerful means to test their implications, and ample opportunities to uncover new natural experiments (Brito and Pearl, 2002).

Summary and Recommendation

The thrust of my reaction to Imbens’s article is simple:

It is unreasonable to prefer an “approach” that rejects tools over one that invites and embraces tools.

Technical comparisons of the PO and SCM approaches, using concrete examples, have been published since 1993 in dozens of articles and books in computer science, statistics, epidemiology, and social science, yet none in the econometric literature. Economics students are systematically deprived of even the most elementary graphical tools available to other researchers, for example, to determine if one variable is independent of another given a third, or if a variable is a valid IV given a set S of observed variables.

This avoidance can no longer be justified by appealing to “We have not found this [graphical] approach to aid the drawing of causal inferences” (Imbens and Rubin, 2015, page 25).

To open an effective dialogue and a genuine comparison between the two approaches, I call on Professor Imbens to assume leadership in his capacity as Editor in Chief of Econometrica and invite a comprehensive survey paper on graphical methods for the front page of his Journal. This is how creative editors move their fields forward.

Imbens, G. “Potential Outcome and Directed Acyclic Graph Approaches to Causality: Relevance for Empirical Practice in Economics,” arXiv.19071v1 [stat.ME] 16 Jul 2019.

Imbens, G. and Rubin, D. Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge, MA: Cambridge University Press; 2015.

Imbens, Guido W. Instrumental Variables: An Econometrician’s Perspective. Statist. Sci. 29 (2014), no. 3, 323–358. doi:10.1214/14-STS480. https://projecteuclid.org/euclid.ss/1411437513

Pearl, J. “Causal inference in statistics: An overview”  Statistics Surveys, Vol. 3, 96–146, 2009c.

## August 2, 2017

### 2017 Mid-Summer Update

Filed under: Counterfactual,Discussion,Epidemiology — Judea Pearl @ 12:55 am

Dear friends in causality research,

Welcome to the 2017 Mid-summer greeting from the Ucla Causality Blog.

This greeting discusses the following topics:

1. “The Eight Pillars of Causal Wisdom” and the WCE 2017 Virtual Conference Website.
2. A discussion panel: “Advances in Deep Neural Networks”,
3. Comments on “The Tale Wagged by the DAG”,
4. A new book: “The book of Why”,
5. A new paper: Disjunctive Counterfactuals,
6. Causality in Education Award,
7. News on “Causal Inference: A  Primer”

1. “The Eight Pillars of Causal Wisdom”

The tenth annual West Coast Experiments Conference was held at UCLA on April 24-25, 2017, preceded by a training workshop  on April 23.

You will be pleased to know that the WCE 2017 Virtual Conference Website is now available here:
http://spp.ucr.edu/wce2017/
It provides videos of the talks as well as some of the papers and presentations.

The conference brought together scholars and graduate students in economics, political science and other social sciences who share an interest in causal analysis. Speakers included:

1. Angus Deaton, on Understanding and misunderstanding randomized controlled trials.
2. Chris Auld, on the on-going confusion between regression vs. structural equations in the econometric literature.
3. Clark Glymour, on Explanatory Research vs Confirmatory Research.
4. Elias Barenboim, on the solution to the External Validity problem.
5. Adam Glynn, on Front-door approaches to causal inference.
6. Karthika Mohan, on Missing Data from a causal modeling perspective.
7. Judea Pearl, on “The Eight Pillars of Causal Wisdom.”
8. Adnan Darwiche, on Model-based vs. Model-Blind Approaches to Artificial Intelligence.
9. Niall Cardin, Causal inference for machine learning.
10. Karim Chalak, Measurement Error without Exclusion.
11. Ed Leamer, “Causality Complexities Example: Supply and Demand.
12. Rosa Matzkin, “Identification is simultaneous equation.
13 Rodrigo Pinto, Randomized Biased-controlled Trials.

The video of my lecture “The Eight Pillars of Causal Wisdom” can be watched here:
https://www.youtube.com/watch?v=8nHVUFqI0zk
A transcript of the talk can be found here:
http://spp.ucr.edu/wce2017/Papers/eight_pillars_of.pdf

2. “Advances in Deep Neural Networks”

As part of the its celebration of the 50 years of the Turing Award, the ACM has organized several discussion sessions on selected topics in computer science. I participated in a panel discussion on
“Advances in Deep Neural Networks”, which gave me an opportunity to share thoughts on whether learning methods based solely on data fitting can ever achieve a human-level intelligence. The discussion video can be viewed here:
https://www.youtube.com/watch?v=mFYM9j8bGtg
A position paper that defends these thoughts is available here:
web.cs.ucla.edu/~kaoru/theoretical-impediments.pdf

3. The Tale Wagged by the DAG

An article by this title, authored by Nancy Krieger and George Davey Smith has appeared in the International Journal of Epidemiology, IJE 2016 45(6) 1787-1808.
https://academic.oup.com/ije/issue/45/6#250304-2617148
It is part of a special IJE issue on causal analysis which, for the reasons outlined below, should be of interest to readers of this blog.

As the title tell-tales us, the authors are unhappy with the direction that modern epidemiology has taken, which is too wedded to a two-language framework:
(1) Graphical models (DAGs) — to express what we know, and
(2) Counterfactuals (or potential outcomes) — to express what we wish to know.

The specific reasons for the authors unhappiness are still puzzling to me, because the article does not demonstrate concrete alternatives to current methodologies. I can only speculate however that it is the dazzling speed with which epidemiology has modernized its tools that lies behind the authors discomfort. If so, it would be safe for us to assume that the discomfort will subside as soon as researchers gain greater familiarity with the capabilities and flexibility of these new tools.  I nevertheless recommend that the article, and the entire special issue of IJE be studied by our readers, because they reflect an interesting soul-searching attempt by a forward-looking discipline to assess its progress in the wake of a profound paradigm shift.

Epidemiology, as I have written on several occasions, has been a pioneer in accepting the DAG-counterfactuals symbiosis as a ruling paradigm — way ahead of mainstream statistics and its other satellites. (The social sciences, for example, are almost there, with the exception of the model-blind branch of econometrics. See Feb. 22 2017 posting)

In examining the specific limitations that Krieger and Davey Smith perceive in DAGs, readers will be amused to note that these limitations coincide precisely with the strengths for which DAGs are praised.

For example, the article complains that DAGs provide no information about variables that investigators chose not to include in the model.  In their words: “the DAG does not provide a comprehensive picture. For example, it does not include paternal factors, ethnicity, respiratory infections or socioeconomic position…” (taken from the Editorial introduction). I have never considered this to be a limitation of DAGs or of any other scientific modelling. Quite the contrary. It would be a disaster if models were permitted to provide information unintended by the modeller. Instead, I have learned to admire the ease with which DAGs enable researchers to incorporate knowledge about new variables, or new mechanisms, which the modeller wishes
to embrace.

Model misspecification, after all,  is a problem that plagues every  exercise in causal inference, no matter what framework one chooses to adapt. It can only be cured by careful model-building
strategies, and by enhancing the modeller’s knowledge. Yet, when it comes to minimizing misspecification errors, DAGS have no match. The transparency with which DAGs display the causal assumptions in the model, and the ease with which the DAG identifies the testable implications of those assumptions are incomparable; these facilitate speedy model diagnosis and repair with no match in sight.

Or, to take another example, the authors call repeatedly for an ostensibly unavailable methodology which they label “causal triangulation” (it appears 19 times in the article). In their words: “In our field, involving dynamic populations of people in dynamic societies and ecosystems, methodical triangulation of diverse types of evidence from diverse types of study settings and involving diverse populations is essential.”  Ironically, however, the task of treating “diverse type of evidence from diverse populations” has been accomplished quite successfully in the dag-counterfactual framework. See, for example the formal and complete results of (Bareinbaum and Pearl, 2016, http://ftp.cs.ucla.edu/pub/stat_ser/r450-reprint.pdf) which have emerged from DAG-based perspective and invoke the do-calculus. (See also  It is inconceivable for me to imagine anyone pooling data from two different designs (say
experimental and observational) without resorting to DAGs or (equivalently) potential outcomes, I am open to learn.

Another conceptual paradigm which the authors hope would liberate us from the tyranny of DAGs and counterfactuals is Lipton’s (2004) romantic aspiration for “Inference to the Best Explanation.” It is a compelling, century old mantra, going back at least to Charles Pierce theory of abduction (Pragmatism and Pragmaticism, 1870) which, unfortunately, has never operationalized its key terms: “explanation,” “Best” and “inference to”.  Again, I know of only one framework in which this aspiration has been explicated with sufficient precision to produce tangible results — it is the structural framework of DAGs and counterfactuals. See, for example, Causes of Effects and Effects of Causes”
http://ftp.cs.ucla.edu/pub/stat_ser/r431-reprint.pdf
and Halpern and Pearl (2005) “Causes and explanations: A structural-model approach”
http://ftp.cs.ucla.edu/pub/stat_ser/r266-part1.pdf

In summary, what Krieger and Davey Smith aspire to achieve by abandoning the structural framework has already been accomplished with the help and grace of that very framework.
More generally, what we learn from these examples is that the DAG-counterfactual symbiosis is far from being a narrow “ONE approach to causal inference” which ” may potentially lead to spurious causal inference” (their words). It is in fact a broad and flexible framework within which a plurality of tasks and aspirations can be formulated, analyzed and implemented. The quest for metaphysical alternatives is not warranted.

I was pleased to note that, by and large, commentators on Krieger and Davey Smith paper seemed to be aware of the powers and generality of the DAG-counterfactual framework, albeit not exactly for the reasons that I have described here. [footnote: I have many disagreements with the other commentators as well, but I wish to focus here on the TALE WAGGED DAG where the problems appear more glaring.] My talk on “The Eight Pillars of Causal Wisdom” provides a concise summary of those reasons and explains why I take the poetic liberty of calling these pillars “The Causal Revolution”
http://spp.ucr.edu/wce2017/Papers/eight_pillars_of.pdf

All in all, I believe that epidemiologists should be commended for the incredible progress they have made in the past two decades. They will no doubt continue to develop and benefit from the new tools that the DAG-counterfactual symbiosis has spawn. At the same time, I hope that the discomfort that Krieger and Davey Smith’s have expressed will be temporary and that it will inspire a greater understanding of the modern tools of causal inference.

Comments on this special issue of IJE are invited on this blog.

4. The Book of WHY

As some of you know, I am co-authoring another book, titled: “The Book of Why: The new science of cause and effect”. It will attempt to present the eight pillars of causal wisdom to the general public using words, intuition and examples to replace equations. My co-author is science writer Dana MacKenzie (danamackenzie.com) and our publishing house is Basic Books. If all goes well, the book will see your shelf by March 2018. Selected sections will appear periodically on this blog.

5. Disjunctive Counterfactuals

The structural interpretation of counterfactuals as formulated in Balke and Pearl (1994) excludes  disjunctive conditionals, such as “had X been x1 or x2”, as well as disjunctive actions such as do(X=x1 or X=x2).  In contrast, the closest-world interpretation of Lewis ( 1973) assigns truth values to all counterfactual sentences, regardless of the logical form of the antecedant. The next issue of the Journal of Causal Inference will include a paper that extends the vocabulary of structural counterfactuals with disjunctions, and clarifies the assumptions needed for the extension. An advance copy can be viewed here:
http://ftp.cs.ucla.edu/pub/stat_ser/r459.pdf

6.  ASA Causality in Statistics Education Award

Congratulations go to Ilya Shpitser, Professor of Computer Science at Johns Hopkins University, who is the 2017 recipient of the ASA Causality in Statistics Education Award.  Funded by Microsoft Research and Google, the \$5,000 Award, will be presented to Shpitser at the 2017 Joint Statistical Meetings (JSM 2017) in Baltimore.

Professor Shpitser has developed Masters level graduate course material that takes causal inference from the ivory towers of research to the level of students with a machine learning and data science background. It combines techniques of graphical and counterfactual models and provides both an accessible coverage of the field and excellent conceptual, computational and project-oriented exercises for students.

These winning materials and those of the previous Causality in Statistics Education Award winners are available to download online at http://www.amstat.org/education/causalityprize/

Information concerning nominations, criteria and previous winners can be viewed here:
http://www.amstat.org/ASA/Your-Career/Awards/Causality-in-Statistics-Education-Award.aspx
and here:
http://magazine.amstat.org/blog/2012/11/01/pearl/

7. News on “Causal Inference: A Primer”

Wiley, the publisher of our latest book “Causal Inference in Statistics: A Primer” (2016, Pearl, Glymour and Jewell) is informing us that the book is now in its 4th printing, corrected for all the errors we (and others) caught since the first publications. To buy a corrected copy, make sure you get the “4th “printing”. The trick is to look at the copyright page and make sure
the last line reads: 10 9 8 7 6 5 4

If you already have a copy, look up our errata page,
http://web.cs.ucla.edu/~kaoru/BIB5/pearl-etal-2016-primer-errata-pages-may2017.pdf
where all corrections are marked in red. The publisher also tells us the the Kindle version is much improved. I hope you concur.

Happy Summer-end, and may all your causes
produce healthy effects.
Judea

## July 9, 2016

### The Three Layer Causal Hierarchy

Filed under: Causal Effect,Counterfactual,Discussion,structural equations — bryantc @ 8:57 pm

Recent discussions concerning causal mediation gave me the impression that many researchers in the field are not familiar with the ramifications of the Causal Hierarchy, as articulated in Chapter 1 of Causality (2000, 2009). This note presents the Causal Hierarchy in table form (Fig. 1) and discusses the distinctions between its three layers: 1. Association, 2. Intervention, 3. Counterfactuals.

Judea

## June 28, 2016

### On the Classification and Subsumption of Causal Models

Filed under: Causal Effect,Counterfactual,structural equations — bryantc @ 5:32 pm

From Christos Dimitrakakis:

>> To be honest, there is such a plethora of causal models, that it is not entirely clear what subsumes what, and which one is equivalent to what. Is there a simple taxonomy somewhere? I thought that influence diagrams were sufficient for all causal questions, for example, but one of Pearl’s papers asserts that this is not the case.

Reply from J. Pearl:

Dear Christos,

From my perspective, I do not see a plethora of causal models at all, so it is hard for me to answer your question in specific terms. What I do see is a symbiosis of all causal models in one framework, called Structural Causal Model (SCM) which unifies structural equations, potential outcomes, and graphical models. So, for me, the world appears simple, well organized, and smiling. Perhaps you can tell us what models lured your attention and caused you to see a plethora of models lacking subsumption taxonomy.

The taxonomy that has helped me immensely is the three-level hierarchy described in chapter 1 of my book Causality: 1. association, 2. intervention, and 3 counterfactuals. It is a useful hierarchy because it has an objective criterion for the classification: You cannot answer questions at level i unless you have assumptions from level i or higher.

As to influence diagrams, the relations between them and SCM is discussed in Section 11.6 of my book Causality (2009), Influence diagrams belong to the 2nd layer of the causal hierarchy, together with Causal Bayesian Networks. They lack however two facilities:

1. The ability to process counterfactuals.
2. The ability to handle novel actions.

To elaborate,

1. Counterfactual sentences (e.g., Given what I see, I should have acted differently) require functional models. Influence diagrams are built on conditional and interventional probabilities, that is, p(y|x) or p(y|do(x)). There is no interpretation of E(Y_x| x’) in this framework.

2. The probabilities that annotate links emanating from Action Nodes are interventional type, p(y|do(x)), that must be assessed judgmentally by the user. No facility is provided for deriving these probabilities from data together with the structure of the graph. Such a derivation is developed in chapter 3 of Causality, in the context of Causal Bayes Networks where every node can turn into an action node.

Using the causal hierarchy, the 1st Law of Counterfactuals and the unification provided by SCM, the space of causal models should shine in clarity and simplicity. Try it, and let us know of any questions remaining.

Judea

## June 20, 2016

### Recollections from the WCE conference at Stanford

Filed under: Counterfactual,General,Mediated Effects,structural equations — bryantc @ 7:45 am

On May 21, Kosuke Imai and I participated in a panel on Mediation, at the annual meeting of the West Coast Experiment Conference, organized by Stanford Graduate School of Business http://www.gsb.stanford.edu/facseminars/conferences/west-coast-experiments-conference. The following are some of my recollections from that panel.

1.
We began the discussion by reviewing causal mediation analysis and summarizing the exchange we had on the pages of Psychological Methods (2014)
http://ftp.cs.ucla.edu/pub/stat_ser/r389-imai-etal-commentary-r421-reprint.pdf

My slides for the panel can be viewed here:
http://web.cs.ucla.edu/~kaoru/stanford-may2016-bw.pdf

We ended with a consensus regarding the importance of causal mediation and the conditions for identifying of Natural Direct and Indirect Effects, from randomized as well as observational studies.

2.
We proceeded to discuss the symbiosis between the structural and the counterfactual languages. Here I focused on slides 4-6 (page 3), and remarked that only those who are willing to solve a toy problem from begining to end, using both potential outcomes and DAGs can understand the tradeoff between the two. Such a toy problem (and its solution) was presented in slide 5 (page 3) titled “Formulating a problem in Three Languages” and the questions that I asked the audience are still ringing in my ears. Please have a good look at these two sets of assumptions and ask yourself:

a. Have we forgotten any assumption?
b. Are these assumptions consistent?
c. Is any of the assumptions redundant (i.e. does it follow logically from the others)?
d. Do they have testable implications?
e. Do these assumptions permit the identification of causal effects?
f. Are these assumptions plausible in the context of the scenario given?

As I was discussing these questions over slide 5, the audience seemed to be in general agreement with the conclusion that, despite their logical equivalence, the graphical language  enables  us to answer these questions immediately while the potential outcome language remains silent on all.

I consider this example to be pivotal to the comparison of the two frameworks. I hope that questions a,b,c,d,e,f will be remembered, and speakers from both camps will be asked to address them squarely and explicitly .

The fact that graduate students made up the majority of the participants gives me the hope that questions a,b,c,d,e,f will finally receive the attention they deserve.

3.
As we discussed the virtues of graphs, I found it necessary to reiterate the observation that DAGs are more than just “natural and convenient way to express assumptions about causal structures” (Imbens and Rubin , 2013, p. 25). Praising their transparency while ignoring their inferential power misses the main role that graphs play in causal analysis. The power of graphs lies in computing complex implications of causal assumptions (i.e., the “science”) no matter in what language they are expressed.  Typical implications are: conditional independencies among variables and counterfactuals, what covariates need be controlled to remove confounding or selection bias, whether effects can be identified, and more. These implications could, in principle, be derived from any equivalent representation of the causal assumption, not necessarily graphical, but not before incurring a prohibitive computational cost. See, for example, what happens when economists try to replace d-separation with graphoid axioms http://ftp.cs.ucla.edu/pub/stat_ser/r420.pdf.

4.
Following the discussion of representations, we addressed questions posed to us by the audience, in particular, five questions submitted by Professor Jon Krosnick (Political Science, Stanford).

I summarize them in the following slide:

Krosnick’s Questions to Panel
———————————————-
1) Do you think an experiment has any value without mediational analysis?
2) Is a separate study directly manipulating the mediator useful? How is the second study any different from the first one?
3) Imai’s correlated residuals test seems valuable for distinguishing fake from genuine mediation. Is that so? And how it is related to traditional mediational test?
4) Why isn’t it easy to test whether participants who show the largest increases in the posited mediator show the largest changes in the outcome?
5) Why is mediational analysis any “worse” than any other method of investigation?
———————————————-
My answers focused on question 2, 4 and 5, which I summarize below:

2)
Q. Is a separate study directly manipulating the mediator useful?
Answer: Yes, it is useful if physically feasible but, still, it cannot give us an answer to the basic mediation question: “What percentage of the observed response is due to mediation?” The concept of mediation is necessarily counterfactual, i.e. sitting on the top layer of the causal hierarchy (see “Causality” chapter 1). It cannot be defined therefore in terms of population experiments, however clever. Mediation can be evaluated with the help of counterfactual assumptions such as “conditional ignorability” or “no interaction,” but these assumptions cannot be verified in population experiments.

4)
Q. Why isn’t it easy to test whether participants who show the largest increases in the posited mediator show the largest changes in the outcome?
Answer: Translating the question to counterfactual notation the test suggested requires the existence of monotonic function f_m such that, for every individual, we have Y_1 – Y_0 =f_m (M_1 – M_0)

This condition expresses a feature we expect to find in mediation, but it cannot be taken as a DEFINITION of mediation. This condition is essentially the way indirect effects are defined in the Principal Strata framework (Frangakis and Rubin, 2002) the deficiencies of which are well known. See http://ftp.cs.ucla.edu/pub/stat_ser/r382.pdf.

In particular, imagine a switch S controlling two light bulbs L1 and L2. Positive correlation between L1 and L2 does not mean that L1 mediates between the switch and L2. Many examples of incompatibility are demonstrated in the paper above.

The conventional mediation tests (in the Baron and Kenny tradition) suffer from the same problem; they test features of mediation that are common in linear systems, but not the essence of mediation which is universal to all systems, linear and nonlinear, continuous as well as categorical variables.

5)
Q. Why is mediational analysis any “worse” than any other method of investigation?
Answer: The answer is closely related to the one given to question 3). Mediation is not a “method” but a property of the population which is defined counterfactually, and therefore requires counterfactual assumption for evaluation. Experiments are not sufficient; and in this sense mediation is “worse” than other properties under investigation, eg., causal effects, which can be estimated entirely from experiments.

About the only thing we can ascertain experimentally is whether the (controlled) direct effect differs from the total effect, but we cannot evaluate the extent of mediation.

Another way to appreciate why stronger assumptions are needed for mediation is to note that non-confoundedness is not the same as ignorability. For non-binary variables one can construct examples where X and Y are not confounded ( i.e., P(y|do(x))= P(y|x)) and yet they are not ignorable, (i.e., Y_x is not independent of X.) Mediation requires ignorability in addition to nonconfoundedness.

Summary
Overall, the panel was illuminating, primarily due to the active participation of curious students. It gave me good reasons to believe that Political Science is destined to become a bastion of modern causal analysis. I wish economists would follow suit, despite the hurdles they face in getting causal analysis to economics education.
http://ftp.cs.ucla.edu/pub/stat_ser/r391.pdf
http://ftp.cs.ucla.edu/pub/stat_ser/r395.pdf

Judea

## August 11, 2015

### Mid-Summer Greeting from the UCLA Causality Blog

Filed under: Announcement,Causal Effect,Counterfactual,General — moderator @ 6:09 pm

Friends in causality research,

This mid-summer greeting of UCLA Causality blog contains:
A. News items concerning causality research
B. Discussions and scientific results

1. The next issue of the Journal of Causal Inference is scheduled to appear this month, and the table of content can be viewed here.

2. A new digital journal “Observational Studies” is out this month (link) and its first issue is dedicated to the legacy of William Cochran (1909-1980).

My contribution to this issue can be viewed here:
http://ftp.cs.ucla.edu/pub/stat_ser/r456.pdf

See also comment 1 below.

3. A video recording of my Cassel Lecture at the SER conference, June 2015, Denver, CO, can be viewed here:
https://epiresearch.org/about-us/archives/video-archives-2/the-scientific-approach-to-causal-inference/

4. A video of a conversation with Robert Gould concerning the teaching of causality can be viewed on Wiley’s Statistics Views, link (2 parts, scroll down).

5. We are informed of the upcoming publication of a new book, Rex Kline “Principles and Practice of Structural Equation Modeling, Fourth Edition (link). Judging by the chapters I read, this book promises to be unique; it treats structural equation models for what they are: carriers of causal assumptions and tools for causal inference. Kudos, Rex.

6. We are informed of another book on causal inference: Imbens, Guido W.; Rubin, Donald B. “Causal Inference in Statistics, Social, and Biomedical Sciences: An Introduction” Cambridge University Press (2015). Readers will quickly realize that the ideas, methods, and tools discussed on this blog were kept out of this book. Omissions include: Control of confounding, testable implications of causal assumptions, visualization of causal assumptions, generalized instrumental variables, mediation analysis, moderation, interaction, attribution, external validity, explanation, representation of scientific knowledge and, most importantly, the unification of potential outcomes and structural models.

Given that the book is advertised as describing “the leading analysis methods” of causal inference, unsuspecting readers will get the impression that the field as a whole is facing fundamental obstacles, and that we are still lacking the tools to cope with basic causal tasks such as confounding control and model testing. I do not believe mainstream methods of causal inference are in such state of helplessness.

The authors’ motivation and rationale for this exclusion were discussed at length on this blog. See
“Are economists smarter than epidemiologists”
http://causality.cs.ucla.edu/blog/?p=1241

and “On the First Law of Causal Inference”
http://causality.cs.ucla.edu/blog/?m=201411

As most of you know, I have spent many hours trying to explain to leaders of the potential outcome school what insights and tools their students would be missing if not given exposure to a broader intellectual environment, one that embraces model-based inferences side by side with potential outcomes.

This book confirms my concerns, and its insularity-based impediments are likely to evoke interesting public discussions on the subject. For example, educators will undoubtedly wish to ask:

(1) Is there any guidance we can give students on how to select covariates for matching or adjustment?.

(2) Are there any tools available to help students judge the plausibility of ignorability-type assumptions?

(3) Aren’t there any methods for deciding whether identifying assumptions have testable implications?.

I believe that if such questions are asked often enough, they will eventually evoke non-ignorable answers.

7. The ASA has come up with a press release yesterday, recognizing Tyler VanderWeele’s new book “Explanation in Causal Inference,” winner of the 2015 Causality in Statistics Education Award
http://www.amstat.org/newsroom/pressreleases/JSM2015-CausalityinStatisticsEducationAward.pdf

Congratulations, Tyler.

Information on nominations for the 2016 Award will soon be announced.

8. Since our last Greetings (Spring, 2015) we have had a few lively discussions posted on this blog. I summarize them below:

8.1. Indirect Confounding and Causal Calculus
(How getting too anxious to criticize do-calculus may cause you to miss an easy solution to a problem you thought was hard).
July 23, 2015
http://causality.cs.ucla.edu/blog/?p=1545

8.2. Does Obesity Shorten Life? Or is it the Soda?
(Discusses whether it was the earth that caused the apple to fall? or the gravitational field created by the earth?.)
May 27, 2015
http://causality.cs.ucla.edu/blog/?p=1534

8.3. Causation without Manipulation
(Asks whether anyone takes this mantra seriously nowadays, and whether we need manipulations to store scientific knowledge)
May 14, 2015
http://causality.cs.ucla.edu/blog/?p=1518

8.4. David Freedman, Statistics, and Structural Equation Models
(On why Freedman invented “response schedule”?)
May 6, 2015
http://causality.cs.ucla.edu/blog/?p=1502

8.5. We also had a few breakthroughs posted on our technical report page
http://bayes.cs.ucla.edu/csl_papers.html

My favorites this summer are these two:
http://ftp.cs.ucla.edu/pub/stat_ser/r452.pdf
http://ftp.cs.ucla.edu/pub/stat_ser/r450.pdf
because they deal with the tough and long-standing problem:
“How generalizable are empirical studies?”

Enjoy the rest of the summer
Judea

## May 6, 2015

### David Freedman, Statistics, and Structural Equation Models

Filed under: Causal Effect,Counterfactual,Definition,structural equations — moderator @ 12:40 am

(Re-edited: 5/6/15, 4 pm)

Michael A Lewis (Hunter College) sent us the following query:

Dear Judea,
I was reading a book by the late statistician David Freedman and in it he uses the term “response schedule” to refer to an equation which represents a causal relationship between variables. It appears that he’s using that term as a synonym for “structural equation” the one you use. In your view, am I correct in regarding these as synonyms? Also, Freedman seemed to be of the belief that response schedules only make sense if the causal variable can be regarded as amenable to manipulation. So variables like race, gender, maybe even socioeconomic status, etc. cannot sensibly be regarded as causes since they can’t be manipulated. I’m wondering what your view is of this manipulation perspective.
Michael

My answer is: Yes. Freedman’s “response schedule” is a synonym for “structural equation.” The reason why Freedman did not say so explicitly has to do with his long and rather bumpy journey from statistical to causal thinking. Freedman, like most statisticians in the 1980’s could not make sense of the Structural Equation Models (SEM) that social scientists (e.g., Duncan) and econometricians (e.g., Goldberger) have adopted for representing causal relations. As a result, he criticized and ridiculed this enterprise relentlessly. In his (1987) paper “As others see us,” for example, he went as far as “proving” that the entire enterprise is grounded in logical contradictions. The fact that SEM researchers at that time could not defend their enterprise effectively (they were as confused about SEM as statisticians — judging by the way they responded to his paper) only intensified Freedman criticism. It continued well into the 1990’s, with renewed attacks on anything connected with causality, including the causal search program of Spirtes, Glymour and Scheines.

I have had a long and friendly correspondence with Freedman since 1993 and, going over a file of over 200 emails, it appears that it was around 1994 when he began to convert to causal thinking. First through the do-operator (by his own admission) and, later, by realizing that structural equations offer a neat way of encoding counterfactuals.

I speculate that the reason Freedman could not say plainly that causality is based on structural equations was that it would have been too hard for him to admit that he was in error criticizing a model that he misunderstood, and, that is so simple to understand. This oversight was not entirely his fault; for someone trying to understand the world from a statistical view point, structural equations do not make any sense; the asymmetric nature of the equations and those slippery “error terms” stand outside the prism of the statistical paradigm. Indeed, even today, very few statisticians feel comfortable in the company of structural equations. (How many statistics textbooks do we know that discuss structural equations?)

So, what do you do when you come to realize that a concept you ridiculed for 20 years is the key to understanding causation? Freedman decided not to say “I erred”, but to argue that the concept was not rigorous enough for statisticians to understood. He thus formalized “response schedule” and treated it as a novel mathematical object. The fact is, however, that if we strip “response schedule” from its superlatives, we find that it is just what you and I call a “function”. i.e., a mapping between the states of one variable onto the states of another. Some of Freedman’s disciples are admiring this invention (See R. Berk’s 2004 book on regression) but most people that I know just look at it and say: This is what a structural equation is.

The story of David Freedman is the story of statistical science itself and the painful journey the field has taken through the causal reformation. Starting with the structural equations of Sewal Wright (1921), and going through Freedman’s “response schedule”, the field still can’t swallow the fundamental building block of scientific thinking, in which Nature is encoded as a society of sensing and responding variables. Funny, econometrics is yet to start its reformation, though it has been housing SEM since Haavelmo (1943). (How many econometrics textbooks do we know which teach students how to read counterfactuals from structural equations?).

I now go to your second question, concerning the mantra “no causation without manipulation.” I do not believe anyone takes this slogan as a restriction nowadays, including its authors, Holland and Rubin. It will remain a relic of an era when statisticians tried to define causation with the only mental tool available to them: the randomized controlled trial (RCT).

I summed it up in Causality, 2009, p. 361: “To suppress talk about how gender causes the many biological, social, and psychological distinctions between males an females is to suppress 90% of our knowledge about gender differences”

I further elaborated on this issue in (Bollen and Pearl 2014 p. 313) saying:

Pearl (2011) further shows that this restriction has led to harmful consequence by forcing investigators to compromise their research questions only to avoid the manipulability restriction. The essential ingredient of causation, as argued in Pearl (2009: 361), is responsiveness, namely, the capacity of some variables to respond to variations in other variables, regardless of how those variations came about.”

In (Causality 2009 p. 361) I also find this paragraph: “It is for that reason, perhaps, that scientists invented counterfactuals; it permit them to state and conceive the realization of antecedent conditions without specifying the physical means by which these conditions are established;”

All in all, you have touched on one of the most fascinating chapters in the history of science, featuring a respectable scientific community that clings desperately to an outdated dogma, while resisting, adamantly, the light that shines around it. This chapter deserves a major headline in Kuhn’s book on scientific revolutions. As I once wrote: “It is easier to teach Copernicus in the Vatican than discuss causation with a statistician.” But this was in the 1990’s, before causal inference became fashionable. Today, after a vicious 100-year war of reformation, things are begining to change (See http://www.nasonline.org/programs/sackler-colloquia/completed_colloquia/Big-data.html). I hope your upcoming book further accelerates the transition.

## April 24, 2015

### Flowers of the First Law of Causal Inference (3)

Flower 3 — Generalizing experimental findings

Continuing our examination of “the flowers of the First Law” (see previous flowers here and here) this posting looks at one of the most crucial questions in causal inference: “How generalizable are our randomized clinical trials?” Readers of this blog would be delighted to learn that one of our flowers provides an elegant and rather general answer to this question. I will describe this answer in the context of transportability theory, and compare it to the way researchers have attempted to tackle the problem using the language of ignorability. We will see that ignorability-type assumptions are fairly limited, both in their ability to define conditions that permit generalizations, and in our ability to justify them in specific applications.

1. Transportability and Selection Bias
The problem of generalizing experimental findings from the trial sample to the population as a whole, also known as the problem of “sample selection-bias” (Heckman, 1979; Bareinboim et al., 2014), has received wide attention lately, as more researchers come to recognize this bias as a major threat to the validity of experimental findings in both the health sciences (Stuart et al., 2015) and social policy making (Manski, 2013).

Since participation in a randomized trial cannot be mandated, we cannot guarantee that the study population would be the same as the population of interest. For example, the study population may consist of volunteers, who respond to financial and medical incentives offered by pharmaceutical firms or experimental teams, so, the distribution of outcomes in the study may differ substantially from the distribution of outcomes under the policy of interest.

Another impediment to the validity of experimental finding is that the types of individuals in the target population may change over time. For example, as more individuals become eligible for health insurance, the types of individuals seeking services would no longer match the type of individuals that were sampled for the study. A similar change would occur as more individuals become aware of the efficacy of the treatment. The result is an inherent disparity between the target population and the population under study.

The problem of generalizing across disparate populations has received a formal treatment in (Pearl and Bareinboim, 2014) where it was labeled “transportability,” and where necessary and sufficient conditions for valid generalization were established (see also Bareinboim and Pearl, 2013). The problem of selection bias, though it has some unique features, can also be viewed as a nuance of the transportability problem, thus inheriting all the theoretical results established in (Pearl and Bareinboim, 2014) that guarantee valid generalizations. We will describe the two problems side by side and then return to the distinction between the type of assumptions that are needed for enabling generalizations.

The transportability problem concerns two dissimilar populations, Π and Π, and requires us to estimate the average causal effect P(yx) (explicitly: P(yx) ≡ P(Y = y|do(X = x)) in the target population Π, based on experimental studies conducted on the source population Π. Formally, we assume that all differences between Π and Π can be attributed to a set of factors S that produce disparities between the two, so that P(yx) = P(yx|S = 1). The information available to us consists of two parts; first, treatment effects estimated from experimental studies in Π and, second, observational information extracted from both Π and Π. The former can be written P(y|do(x),z), where Z is set of covariates measured in the experimental study, and the latters are written P(x, y, z) = P (x, y, z|S = 1), and P (x, y, z) respectively. In addition to this information, we are also equipped with a qualitative causal model M, that encodes causal relationships in Π and Π, with the help of which we need to identify the query P(yx). Mathematically, identification amounts to transforming the query expression

P(yx) = P(y|do(x),S = 1)

into a form derivable from the available information ITR, where

ITR = { P(y|do(x),z),  P(x,y,z|S = 1),   P(x,y,z) }.

The selection bias problem is slightly different. Here the aim is to estimate the average causal effect P(yx) in the Π population, while the experimental information available to us, ISB, comes from a preferentially selected sample, S = 1, and is given by P (y|do(x), z, S = 1). Thus, the selection bias problem calls for transforming the query P(yx) to a form derivable from the information set:

ISB = { P(y|do(x),z,S = 1), P(x,y,z|S = 1), P(x,y,z) }.

In the Appendix section, we demonstrate how transportability problems and selection bias problems are solved using the transformations described above.

The analysis reported in (Pearl and Bareinboim, 2014) has resulted in an algorithmic criterion (Bareinboim and Pearl, 2013) for deciding whether transportability is feasible and, when confirmed, the algorithm produces an estimand for the desired effects. The algorithm is complete, in the sense that, when it fails, a consistent estimate of the target effect does not exist (unless one strengthens the assumptions encoded in M).

There are several lessons to be learned from this analysis when considering selection bias problems.

1. The graphical criteria that authorize transportability are applicable to selection bias problems as well, provided that the graph structures for the two problems are identical. This means that whenever a selection bias problem is characterizes by a graph for which transportability is feasible, recovery from selection bias is feasible by the same algorithm. (The Appendix demonstrates this correspondence).

2. The graphical criteria for transportability are more involved than the ones usually invoked in testing treatment assignment ignorability (e.g., through the back-door test). They may require several d-separation tests on several sub-graphs. It is utterly unimaginable therefore that such criteria could be managed by unaided human judgment, no matter how ingenious. (See discussions with Guido Imbens regarding computational barriers to graph-free causal inference, click here). Graph avoiders, should reckon with this predicament.

3. In general, problems associated with external validity cannot be handled by balancing disparities between distributions. The same disparity between P (x, y, z) and P(x, y, z) may demand different adjustments, depending on the location of S in the causal structure. A simple example of this phenomenon is demonstrated in Fig. 3(b) of (Pearl and Bareinboim, 2014) where a disparity in the average reading ability of two cities requires two different treatments, depending on what causes the disparity. If the disparity emanates from age differences, adjustment is necessary, because age is likely to affect the potential outcomes. If, on the other hand the disparity emanates from differences in educational programs, no adjustment is needed, since education, in itself, does not modify response to treatment. The distinction is made formal and vivid in causal graphs.

4. In many instances, generalizations can be achieved by conditioning on post-treatment variables, an operation that is frowned upon in the potential-outcome framework (Rosenbaum, 2002, pp. 73–74; Rubin, 2004; Sekhon, 2009) but has become extremely useful in graphical analysis. The difference between the conditioning operators used in these two frameworks is echoed in the difference between Qc and Qdo, the two z-specific effects discussed in a previous posting on this blog (link). The latter defines information that is estimable from experimental studies, whereas the former invokes retrospective counterfactual that may or may not be estimable empirically.

In the next Section we will discuss the benefit of leveraging the do-operator in problems concerning generalization.

2. Ignorability versus Admissibility in the Pursuit of Generalization

A key assumption in almost all conventional analyses of generalization (from sample-to-population) is S-ignorability, written Yx ⊥ S|Z where Yx is the potential outcome predicated on the intervention X = x, S is a selection indicator (with S = 1 standing for selection into the sample) and Z a set of observed covariates. This condition, sometimes written as a difference Y1 − Y0 ⊥ S|Z, and sometimes as a conjunction {Y1, Y0} ⊥ S|Z, appears in Hotz et al. (2005); Cole and Stuart (2010); Tipton et al. (2014); Hartman et al. (2015), and possibly other researchers committed to potential-outcome analysis. This assumption says: If we succeed in finding a set Z of pre-treatment covariates such that cross-population differences disappear in every stratum Z = z, then the problem can be solved by averaging over those strata. (Lacking a procedure for finding Z, this solution avoids the harder part of the problem and, in this sense, it somewhat borders on the circular. It amounts to saying: If we can solve the problem in every stratum Z = z then the problem is solved; hardly an informative statement.)

In graphical analysis, on the other hand, the problem of generalization has been studied using another condition, labeled S-admissibility (Pearl and Bareinboim, 2014), which is defined by:

P (y|do(x), z) = P (y|do(x), z, s)

or, using counterfactual notation,

P(yx|zx) = P (yx|zx, sx)

It states that in every treatment regime X = x, the observed outcome Y is conditionally independent of the selection mechanism S, given Z, all evaluated at that same treatment regime.

Clearly, S-admissibility coincides with S-ignorability for pre-treatment S and Z; the two notions differ however for treatment-dependent covariates. The Appendix presents scenarios (Fig. 1(a) and (b)) in which post-treatment covariates Z do not satisfy S-ignorability, but satisfy S-admissibility and, thus, enable generalization to take place. We also present scenarios where both S-ignorability and S-admissibility hold and, yet, experimental findings are not generalizable by standard procedures of post-stratification. Rather the correct procedure is uncovered naturally from the graph structure.

One of the reasons that S-admissibility has received greater attention in the graph-based literature is that it has a very simple graphical representation: Z and X should separate Y from S in a mutilated graph, from which all arrows entering X have been removed. Such a graph depicts conditional independencies among observed variables in the population under experimental conditions, i.e., where X is randomized.

In contrast, S-ignorability has not been given a simple graphical interpretation, but it can be verified from either twin networks (Causality, pp. 213-4) or from counterfactually augmented graphs (Causality, p. 341), as we have demonstrated in an earlier posting on this blog (link). Using either representation, it is easy to see that S-ignorability is rarely satisfied in transportability problems in which Z is a post-treatment variable. This is because, whenever S is a proxy to an ancestor of Z, Z cannot separate Yx from S.

The simplest result of both PO and graph-based approaches is the re-calibration or post-stratification formula. It states that, if Z is a set of pre-treatment covariates satisfying S-ignorability (or S-admissibility), then the causal effect in the population at large can be recovered from a selection-biased sample by a simple re-calibration process. Specifically, if P(yx|S = 1,Z = z) is the z-specific probability distribution of Yx in the sample, then the distribution of Yx in the population at large is given by

P(yx) = ∑z  P(yx|S = 1,z)   P(z)  (*)

where P(z) is the probability of Z = z in the target population (where S = 0). Equation (*) follows from S-ignorability by conditioning on z and, adding S = 1 to the conditioning set – a one-line proof. The proof fails however when Z is treatment dependent, because the counterfactual factor P(yx|S = 1,z) is not normally estimable in the experimental study. (See Qc vs. Qdo discussion here).

As noted in (Keiding, 1987) this re-calibration formula goes back to 18th century demographers (Dale, 1777; Tetens, 1786) facing the task of predicting overall mortality (across populations) from age-specific data. Their reasoning was probably as follows: If the source and target populations differ in distribution by a set of attributes Z, then to correct for these differences we need to weight samples by a factor that would restore similarity to the two distributions. Some researchers view Eq. (*) as a version of Horvitz and Thompson (1952) post-stratification method of estimating the mean of a super-population from un-representative stratified samples. The essential difference between survey sampling calibration and the calibration required in Eq. (*) is that the calibrating covariates Z are not just any set by which the distributions differ; they must satisfy the S-ignorability (or admissibility) condition, which is a causal, not a statistical condition. It is not discernible therefore from distributions over observed variables. In other words, the re-calibration formula should depend on disparities between the causal models of the two populations, not merely on distributional disparities. This is demonstrated explicitly in Fig. 4(c) of (Pearl and Bareinboim, 2014), which is also treated in the Appendix (Fig. 1(a)).

While S-ignorability and S-admissibility are both sufficient for re-calibrating pre-treatment covariates Z, S-admissibility goes further and permits generalizations in cases where Z consists of post-treatment covariates. A simple example is the bio-marker model shown in Fig. 4(c) (Example 3) of (Pearl and Bareinboim, 2014), which is also discussed in the Appendix.

Conclusions

1. Many opportunities for generalization are opened up through the use of post-treatment variables. These opportunities remain inaccessible to ignorability-based analysis, partly because S-ignorability does not always hold for such variables but, mainly, because ignorability analysis requires information in the form of z-specific counterfactuals, which is often not estimable from experimental studies.

2. Most of these opportunities have been chartered through the completeness results for transportability (Bareinboim et al., 2014), others can be revealed by simple derivations in do-calculus as shown in the Appendix.

3. There is still the issue of assisting researchers in judging whether S-ignorability (or S-admissibility) is plausible in any given application. Graphs excel in this dimension because graphs match the format in which people store scientific knowledge. Some researchers prefer to do it by direct appeal to intuition; they do so at their own peril.

For references and appendix, click here.

Next Page »

Powered by WordPress