Causal Analysis in Theory and Practice

July 9, 2016

The Three Layer Causal Hierarchy

Filed under: Causal Effect,Counterfactual,Discussion,structural equations — bryantc @ 8:57 pm

Recent discussions concerning causal mediation gave me the impression that many researchers in the field are not familiar with the ramifications of the Causal Hierarchy, as articulated in Chapter 1 of Causality (2000, 2009). This note presents the Causal Hierarchy in table form (Fig. 1) and discusses the distinctions between its three layers: 1. Association, 2. Intervention, 3. Counterfactuals.

Judea

June 28, 2016

On the Classification and Subsumption of Causal Models

Filed under: Causal Effect,Counterfactual,structural equations — bryantc @ 5:32 pm

From Christos Dimitrakakis:

>> To be honest, there is such a plethora of causal models, that it is not entirely clear what subsumes what, and which one is equivalent to what. Is there a simple taxonomy somewhere? I thought that influence diagrams were sufficient for all causal questions, for example, but one of Pearl’s papers asserts that this is not the case.

Reply from J. Pearl:

Dear Christos,

From my perspective, I do not see a plethora of causal models at all, so it is hard for me to answer your question in specific terms. What I do see is a symbiosis of all causal models in one framework, called Structural Causal Model (SCM) which unifies structural equations, potential outcomes, and graphical models. So, for me, the world appears simple, well organized, and smiling. Perhaps you can tell us what models lured your attention and caused you to see a plethora of models lacking subsumption taxonomy.

The taxonomy that has helped me immensely is the three-level hierarchy described in chapter 1 of my book Causality: 1. association, 2. intervention, and 3 counterfactuals. It is a useful hierarchy because it has an objective criterion for the classification: You cannot answer questions at level i unless you have assumptions from level i or higher.

As to influence diagrams, the relations between them and SCM is discussed in Section 11.6 of my book Causality (2009), Influence diagrams belong to the 2nd layer of the causal hierarchy, together with Causal Bayesian Networks. They lack however two facilities:

1. The ability to process counterfactuals.
2. The ability to handle novel actions.

To elaborate,

1. Counterfactual sentences (e.g., Given what I see, I should have acted differently) require functional models. Influence diagrams are built on conditional and interventional probabilities, that is, p(y|x) or p(y|do(x)). There is no interpretation of E(Y_x| x’) in this framework.

2. The probabilities that annotate links emanating from Action Nodes are interventional type, p(y|do(x)), that must be assessed judgmentally by the user. No facility is provided for deriving these probabilities from data together with the structure of the graph. Such a derivation is developed in chapter 3 of Causality, in the context of Causal Bayes Networks where every node can turn into an action node.

Using the causal hierarchy, the 1st Law of Counterfactuals and the unification provided by SCM, the space of causal models should shine in clarity and simplicity. Try it, and let us know of any questions remaining.

Judea

June 20, 2016

Recollections from the WCE conference at Stanford

Filed under: Counterfactual,General,Mediated Effects,structural equations — bryantc @ 7:45 am

On May 21, Kosuke Imai and I participated in a panel on Mediation, at the annual meeting of the West Coast Experiment Conference, organized by Stanford Graduate School of Business http://www.gsb.stanford.edu/facseminars/conferences/west-coast-experiments-conference. The following are some of my recollections from that panel.

1.
We began the discussion by reviewing causal mediation analysis and summarizing the exchange we had on the pages of Psychological Methods (2014)
http://ftp.cs.ucla.edu/pub/stat_ser/r389-imai-etal-commentary-r421-reprint.pdf

My slides for the panel can be viewed here:
http://web.cs.ucla.edu/~kaoru/stanford-may2016-bw.pdf

We ended with a consensus regarding the importance of causal mediation and the conditions for identifying of Natural Direct and Indirect Effects, from randomized as well as observational studies.

2.
We proceeded to discuss the symbiosis between the structural and the counterfactual languages. Here I focused on slides 4-6 (page 3), and remarked that only those who are willing to solve a toy problem from begining to end, using both potential outcomes and DAGs can understand the tradeoff between the two. Such a toy problem (and its solution) was presented in slide 5 (page 3) titled “Formulating a problem in Three Languages” and the questions that I asked the audience are still ringing in my ears. Please have a good look at these two sets of assumptions and ask yourself:

a. Have we forgotten any assumption?
b. Are these assumptions consistent?
c. Is any of the assumptions redundant (i.e. does it follow logically from the others)?
d. Do they have testable implications?
e. Do these assumptions permit the identification of causal effects?
f. Are these assumptions plausible in the context of the scenario given?

As I was discussing these questions over slide 5, the audience seemed to be in general agreement with the conclusion that, despite their logical equivalence, the graphical language  enables  us to answer these questions immediately while the potential outcome language remains silent on all.

I consider this example to be pivotal to the comparison of the two frameworks. I hope that questions a,b,c,d,e,f will be remembered, and speakers from both camps will be asked to address them squarely and explicitly .

The fact that graduate students made up the majority of the participants gives me the hope that questions a,b,c,d,e,f will finally receive the attention they deserve.

3.
As we discussed the virtues of graphs, I found it necessary to reiterate the observation that DAGs are more than just “natural and convenient way to express assumptions about causal structures” (Imbens and Rubin , 2013, p. 25). Praising their transparency while ignoring their inferential power misses the main role that graphs play in causal analysis. The power of graphs lies in computing complex implications of causal assumptions (i.e., the “science”) no matter in what language they are expressed.  Typical implications are: conditional independencies among variables and counterfactuals, what covariates need be controlled to remove confounding or selection bias, whether effects can be identified, and more. These implications could, in principle, be derived from any equivalent representation of the causal assumption, not necessarily graphical, but not before incurring a prohibitive computational cost. See, for example, what happens when economists try to replace d-separation with graphoid axioms http://ftp.cs.ucla.edu/pub/stat_ser/r420.pdf.

4.
Following the discussion of representations, we addressed questions posed to us by the audience, in particular, five questions submitted by Professor Jon Krosnick (Political Science, Stanford).

I summarize them in the following slide:

Krosnick’s Questions to Panel
———————————————-
1) Do you think an experiment has any value without mediational analysis?
2) Is a separate study directly manipulating the mediator useful? How is the second study any different from the first one?
3) Imai’s correlated residuals test seems valuable for distinguishing fake from genuine mediation. Is that so? And how it is related to traditional mediational test?
4) Why isn’t it easy to test whether participants who show the largest increases in the posited mediator show the largest changes in the outcome?
5) Why is mediational analysis any “worse” than any other method of investigation?
———————————————-
My answers focused on question 2, 4 and 5, which I summarize below:

2)
Q. Is a separate study directly manipulating the mediator useful?
Answer: Yes, it is useful if physically feasible but, still, it cannot give us an answer to the basic mediation question: “What percentage of the observed response is due to mediation?” The concept of mediation is necessarily counterfactual, i.e. sitting on the top layer of the causal hierarchy (see “Causality” chapter 1). It cannot be defined therefore in terms of population experiments, however clever. Mediation can be evaluated with the help of counterfactual assumptions such as “conditional ignorability” or “no interaction,” but these assumptions cannot be verified in population experiments.

4)
Q. Why isn’t it easy to test whether participants who show the largest increases in the posited mediator show the largest changes in the outcome?
Answer: Translating the question to counterfactual notation the test suggested requires the existence of monotonic function f_m such that, for every individual, we have Y_1 – Y_0 =f_m (M_1 – M_0)

This condition expresses a feature we expect to find in mediation, but it cannot be taken as a DEFINITION of mediation. This condition is essentially the way indirect effects are defined in the Principal Strata framework (Frangakis and Rubin, 2002) the deficiencies of which are well known. See http://ftp.cs.ucla.edu/pub/stat_ser/r382.pdf.

In particular, imagine a switch S controlling two light bulbs L1 and L2. Positive correlation between L1 and L2 does not mean that L1 mediates between the switch and L2. Many examples of incompatibility are demonstrated in the paper above.

The conventional mediation tests (in the Baron and Kenny tradition) suffer from the same problem; they test features of mediation that are common in linear systems, but not the essence of mediation which is universal to all systems, linear and nonlinear, continuous as well as categorical variables.

5)
Q. Why is mediational analysis any “worse” than any other method of investigation?
Answer: The answer is closely related to the one given to question 3). Mediation is not a “method” but a property of the population which is defined counterfactually, and therefore requires counterfactual assumption for evaluation. Experiments are not sufficient; and in this sense mediation is “worse” than other properties under investigation, eg., causal effects, which can be estimated entirely from experiments.

About the only thing we can ascertain experimentally is whether the (controlled) direct effect differs from the total effect, but we cannot evaluate the extent of mediation.

Another way to appreciate why stronger assumptions are needed for mediation is to note that non-confoundedness is not the same as ignorability. For non-binary variables one can construct examples where X and Y are not confounded ( i.e., P(y|do(x))= P(y|x)) and yet they are not ignorable, (i.e., Y_x is not independent of X.) Mediation requires ignorability in addition to nonconfoundedness.

Summary
Overall, the panel was illuminating, primarily due to the active participation of curious students. It gave me good reasons to believe that Political Science is destined to become a bastion of modern causal analysis. I wish economists would follow suit, despite the hurdles they face in getting causal analysis to economics education.
http://ftp.cs.ucla.edu/pub/stat_ser/r391.pdf
http://ftp.cs.ucla.edu/pub/stat_ser/r395.pdf

Judea

February 12, 2016

Winter Greeting from the UCLA Causality Blog

Friends in causality research,
This greeting from the UCLA Causality blog contains:

A. An introduction to our newly published book, Causal Inference in Statistics – A Primer, Wiley 2016 (with M. Glymour and N. Jewell)
B. Comments on two other books: (1) R. Klein’s Structural Equation Modeling and (2) L Pereira and A. Saptawijaya’s on Machine Ethics.
C. News, Journals, awards and other frills.

A.
Our publisher (Wiley) has informed us that the book “Causal Inference in Statistics – A Primer” by J. Pearl, M. Glymour and N. Jewell is already available on Kindle, and will be available in print Feb. 26, 2016.
http://www.amazon.com/Causality-A-Primer-Judea-Pearl/dp/1119186846
http://www.amazon.com/Causal-Inference-Statistics-Judea-Pearl-ebook/dp/B01B3P6NJM/ref=mt_kindle?_encoding=UTF8&me=

This book introduces core elements of causal inference into undergraduate and lower-division graduate classes in statistics and data-intensive sciences. The aim is to provide students with the understanding of how data are generated and interpreted at the earliest stage of their statistics education. To that end, the book empowers students with models and tools that answer nontrivial causal questions using vivid examples and simple mathematics. Topics include: causal models, model testing, effects of interventions, mediation and counterfactuals, in both linear and nonparametric systems.

The Table of Contents, Preface and excerpts from the four chapters can be viewed here:
http://bayes.cs.ucla.edu/PRIMER/
A book website providing answers to home-works and interactive computer programs for simulation and analysis (using dagitty)  is currently under construction.

B1
We are in receipt of the fourth edition of Rex Kline’s book “Principles and Practice of Structural Equation Modeling”, http://psychology.concordia.ca/fac/kline/books/nta.pdf

This book is unique in that it treats structural equation models (SEMs) as carriers of causal assumptions and tools for causal inference. Gone are the inhibitions and trepidation that characterize most SEM texts in their treatments of causation.

To the best of my knowledge, Chapter 8 in Kline’s book is the first SEM text to introduce graphical criteria for parameter identification — a long overdue tool
in a field that depends on identifiability for model “fitting”. Overall, the book elevates SEM education to new heights and promises to usher a renaissance for a field that, five decades ago, has pioneered causal analysis in the behavioral sciences.

B2
Much has been written lately on computer ethics, morality, and free will. The new book “Programming Machine Ethics” by Luis Moniz Pereira and Ari Saptawijaya formalizes these concepts in the language of logic programming. See book announcement http://www.springer.com/gp/book/9783319293530. As a novice to the literature on ethics and morality, I was happy to find a comprehensive compilation of the many philosophical works on these topics, articulated in a language that even a layman can comprehend. I was also happy to see the critical role that the logic of counterfactuals plays in moral reasoning. The book is a refreshing reminder that there is more to counterfactual reasoning than “average treatment effects”.

C. News, Journals, awards and other frills.
C1.
Nominations are Invited for the Causality in Statistics Education Award (Deadline is February 15, 2016).

The ASA Causality in Statistics Education Award is aimed at encouraging the teaching of basic causal inference in introductory statistics courses. Co-sponsored by Microsoft Research and Google, the prize is motivated by the growing importance of introducing core elements of causal inference into undergraduate and lower-division graduate classes in statistics. For more information, please see http://www.amstat.org/education/causalityprize/ .

Nominations and questions should be sent to the ASA office at educinfo@amstat.org . The nomination deadline is February 15, 2016.

C.2.
Issue 4.1 of the Journal of Causal Inference is scheduled to appear March 2016, with articles covering all aspects of causal analysis. For mission, policy, and submission information please see: http://degruyter.com/view/j/jci

C.3
Finally, enjoy new results and new insights posted on our technical report page: http://bayes.cs.ucla.edu/csl_papers.html

Judea

May 14, 2015

Causation without Manipulation

The second part of our latest post “David Freedman, Statistics, and Structural Equation Models” (May 6, 2015) has stimulated a lively email discussion among colleagues from several disciplines. In what follows, I will be sharing the highlights of the discussion, together with my own position on the issue of manipulability.

Many of the discussants noted that manipulability is strongly associated (if not equated) with “comfort of interpretation”. For example, we feel more comfortable interpreting sentences of the type “If we do A, then B would be more likely” compared with sentences of the type “If A were true, then B would be more likely”. Some attribute this association to the fact that empirical researchers (say epidemiologists) are interested exclusively in interventions and preventions, not in hypothetical speculations about possible states of the world. The question was raised as to why we get this sense of comfort. Reference was made to the new book by Tyler VanderWeele, where this question is answered quite eloquently:

“It is easier to imagine the rest of the universe being just as it is if a patient took pill A rather than pill B than it is trying to imagine what else in the universe would have had to be different if the temperature yesterday had been 30 degrees rather than 40. It may be the case that human actions, seem sufficiently free that we have an easier time imagining only one specific action being different, and nothing else.”
(T. Vanderweele, “Explanation in causal Inference” p. 453-455)

This sensation of discomfort with non-manipulable causation stands in contrast to the practice of SEM analysis, in which causes are represented as relations among interacting variables, free of external manipulation. To explain this contrast, I note that we should not overlook the purpose for which SEM was created — the representation of scientific knowledge. Even if we agree with the notion that the ultimate purpose of all knowledge is to guide actions and policies, not to engage in hypothetical speculations, the question still remains: How do we encode this knowledge in the mind (or in textbooks) so that it can be accessed, communicated, updated and used to guide actions and policies. By “how” I am concerned with the code, the notation, its
syntax and its format.

There was a time when empirical scientists could dismiss questions of this sort (i.e., “how do we encode”) as psychological curiosa, residing outside the province of “objective” science. But now that we have entered the enterprise of causal inference, and we express concerns over the comfort and discomfort of interpreting counterfactual utterances, we no longer have the luxury of ignoring those questions; we must ask: how do scientists encode knowledge, because this question holds the key to the distinction between the comfortable and the uncomfortable, the clear versus the ambiguous.

The reason I prefer the SEM specification of knowledge over a manipulation-restricted specification comes from the realization that SEM matches the format in which humans store scientific knowledge. (Recall, by “SEM” we mean a manipulation-free society of variables, each listening to the others and each responding to what it hears) In support of this realization, I would like to copy below a paragraph from Wikipedia’s entry on Cholesterol, section on “Clinical Significance.” (It is about 20 lines long but worth a serious linguistic analysis).

——————–from Wikipedia, dated 5/10/15 —————
According to the lipid hypothesis , abnormal cholesterol levels ( hyperchol esterolemia ) or, more properly, higher concentrations of LDL particles and lower concentrations of functional HDL particles are strongly associated with cardiovascular disease because these promote atheroma development in arteries ( atherosclerosis ). This disease process leads to myocardial infraction (heart attack), stroke, and peripheral vascular disease . Since higher blood LDL, especially higher LDL particle concentrations and smaller LDL particle size, contribute to this process more than the cholesterol content of the HDL particles, LDL particles are often termed “bad cholesterol” because they have been linked to atheroma formation. On the other hand, high concentrations of functional HDL, which can remove cholesterol from cells and atheroma, offer protection and are sometimes referred to as “good cholesterol”. These balances are mostly genetically determined, but can be changed by body build, medications , food choices, and other factors. [ 54 ] Resistin , a protein secreted by fat tissue, has been shown to increase the production of LDL in human liver cells and also degrades LDL receptors in the liver. As a result, the liver is less able to clear cholesterol from the bloodstream. Resistin accelerates the accumulation of LDL in arteries, increasing the risk of heart disease. Resistin also adversely impacts the effects of statins, the main cholesterol-reducing drug used in the treatment and prevention of cardiovascular disease.
————-end of quote ——————

My point in quoting this paragraph is to show that, even in “clinical significance” sections, most of the relationships are predicated upon states of variables, as opposed to manipulations of variables. They talk about being “present” or “absent”, being at high concentration or low concentration, smaller particles or larger particles; they talk about variables “enabling,” “disabling,” “promoting,” “leading to,” “contributing to,” etc. Only two of the sentences refer directly to exogenous manipulations, as in “can be changed by body build, medications, food choices…”

This manipulation-free society of sensors and responders that we call “scientific knowledge” is not oblivious to the world of actions and interventions; it was actually created to (1) guide future actions and (2) learn from interventions.

(1) The first frontier is well known. Given a fully specified SEM, we can predict the effect of compound interventions, both static and time varying, pre-planned or dynamic. Moreover, given a partially specified SEM (e.g., a DAG) we can often use data to fill in the missing parts and predict the effect of such interventions. These require however that the interventions be specified by “setting” the values of one or several variables. When the action of interest is more complex, say a disjunctive action like: “paint the wall green or blue” or “practice at least 15 minutes a day”, a more elaborate machinery is needed to infer its effects from the atomic actions and counterfactuals that the model encodes (See http://ftp.cs.ucla.edu/pub/stat_ser/r359.pdf and Hernan etal 2011.) Such derivations are nevertheless feasible from SEM without enumerating the effects of all disjunctive actions of the form “do A or B” (which is obviously infeasible).

(2) The second frontier, learning from interventions, is less developed. We can of course check, using the methods above, whether a given SEM is compatible with the results of experimental studies (Causality, Def.1.3.1). We can also determine the structure of an SEM from a systematic sequence of experimental studies. What we are still lacking though are methods of incremental updating, i.e., given an SEM M and an experimental study that is incompatible with M, modify M so as to match the new study, without violating previous studies, though only their ramifications are encoded in M.

Going back to the sensation of discomfort that people usually express vis a vis non-manipulable causes, should such discomfort bother users of SEM when confronting non-manipulable causes in their model? More concretely, should the difficulty of imagining “what else in the universe would have had to be different if the temperature yesterday had been 30 degrees rather than 40,” be a reason for misinterpreting a model that contains variables labeled “temperature” (the cause) and “sweating” (the effect)? My answer is: No. At the deductive phase of the analysis, when we have a fully specified model before us, the model tells us precisely what else would be different if the temperature yesterday had been 30 degrees rather than 40.”

Consider the sentence “Mary would not have gotten pregnant had she been a man”. I believe most of us would agree with the truth of this sentence despite the fact that we may not have a clue what else in the universe would have had to be different had Mary been a man. And if the model is any good, it would imply that regardless of other things being different (e.g. Mary’s education, income, self esteem etc.) she would not have gotten pregnant. Therefore, the phrase “had she been a man” should not be automatically rejected by interventionists as meaningless — it is quite meaningful.

Now consider the sentence: “If Mary were a man, her salary would be higher.” Here the discomfort is usually higher, presumably because not only we cannot imagine what else in the universe would have had to be different had Mary been a man, but those things (education, self esteem etc.) now make a difference in the outcome (salary). Are we justified now in declaring discomfort? Not when we are reading our model. Given a fully specified SEM, in which gender, education, income, and self esteem are bonified variables, one can compute precisely how those factors should be affected by a gender change. Complaints about “how do we know” are legitimate at the model construction phase, but not when we assume having a fully specified model before us, and merely ask for its ramifications.

To summarize, I believe the discomfort with non-manipulated causes represents a confusion between model utilization and model construction. In the former phase counterfactual sentences are well defined regardless of whether the antecedent is manipulable. It is only when we are asked to evaluate a counterfactual sentence by intuitive, unaided judgment, that we feel discomfort and we are provoked to question whether the counterfactual is “well defined”. Counterfactuals are always well defined relative to a given model, regardless of whether the antecedent is manipulable or not.

This takes us to the key question of whether our models should be informed by the the manipulability restriction and how. Interventionists attempt to convince us that the very concept of causation hinges on manipulability and, hence, that a causal model void of manipulability information is incomplete, if not meaningless. We saw above that SEM, as a representation of scientific knowledge, manages quite well without the manipulability restriction. I would therefore be eager to hear from interventionists what their conception is of “scientific knowledge”, and whether they can envision an alternative to SEM which is informed by the manipulability restriction, and yet provides a parsimonious account of that which we know about the world.

My appeal to interventionists to provide alternatives to SEM has so far not been successful. Perhaps readers care to suggest some? The comment section below is open for suggestions, disputations and clarifications.

May 6, 2015

David Freedman, Statistics, and Structural Equation Models

Filed under: Causal Effect,Counterfactual,Definition,structural equations — moderator @ 12:40 am

(Re-edited: 5/6/15, 4 pm)

Michael A Lewis (Hunter College) sent us the following query:

Dear Judea,
I was reading a book by the late statistician David Freedman and in it he uses the term “response schedule” to refer to an equation which represents a causal relationship between variables. It appears that he’s using that term as a synonym for “structural equation” the one you use. In your view, am I correct in regarding these as synonyms? Also, Freedman seemed to be of the belief that response schedules only make sense if the causal variable can be regarded as amenable to manipulation. So variables like race, gender, maybe even socioeconomic status, etc. cannot sensibly be regarded as causes since they can’t be manipulated. I’m wondering what your view is of this manipulation perspective.
Michael


My answer is: Yes. Freedman’s “response schedule” is a synonym for “structural equation.” The reason why Freedman did not say so explicitly has to do with his long and rather bumpy journey from statistical to causal thinking. Freedman, like most statisticians in the 1980’s could not make sense of the Structural Equation Models (SEM) that social scientists (e.g., Duncan) and econometricians (e.g., Goldberger) have adopted for representing causal relations. As a result, he criticized and ridiculed this enterprise relentlessly. In his (1987) paper “As others see us,” for example, he went as far as “proving” that the entire enterprise is grounded in logical contradictions. The fact that SEM researchers at that time could not defend their enterprise effectively (they were as confused about SEM as statisticians — judging by the way they responded to his paper) only intensified Freedman criticism. It continued well into the 1990’s, with renewed attacks on anything connected with causality, including the causal search program of Spirtes, Glymour and Scheines.

I have had a long and friendly correspondence with Freedman since 1993 and, going over a file of over 200 emails, it appears that it was around 1994 when he began to convert to causal thinking. First through the do-operator (by his own admission) and, later, by realizing that structural equations offer a neat way of encoding counterfactuals.

I speculate that the reason Freedman could not say plainly that causality is based on structural equations was that it would have been too hard for him to admit that he was in error criticizing a model that he misunderstood, and, that is so simple to understand. This oversight was not entirely his fault; for someone trying to understand the world from a statistical view point, structural equations do not make any sense; the asymmetric nature of the equations and those slippery “error terms” stand outside the prism of the statistical paradigm. Indeed, even today, very few statisticians feel comfortable in the company of structural equations. (How many statistics textbooks do we know that discuss structural equations?)

So, what do you do when you come to realize that a concept you ridiculed for 20 years is the key to understanding causation? Freedman decided not to say “I erred”, but to argue that the concept was not rigorous enough for statisticians to understood. He thus formalized “response schedule” and treated it as a novel mathematical object. The fact is, however, that if we strip “response schedule” from its superlatives, we find that it is just what you and I call a “function”. i.e., a mapping between the states of one variable onto the states of another. Some of Freedman’s disciples are admiring this invention (See R. Berk’s 2004 book on regression) but most people that I know just look at it and say: This is what a structural equation is.

The story of David Freedman is the story of statistical science itself and the painful journey the field has taken through the causal reformation. Starting with the structural equations of Sewal Wright (1921), and going through Freedman’s “response schedule”, the field still can’t swallow the fundamental building block of scientific thinking, in which Nature is encoded as a society of sensing and responding variables. Funny, econometrics is yet to start its reformation, though it has been housing SEM since Haavelmo (1943). (How many econometrics textbooks do we know which teach students how to read counterfactuals from structural equations?).


I now go to your second question, concerning the mantra “no causation without manipulation.” I do not believe anyone takes this slogan as a restriction nowadays, including its authors, Holland and Rubin. It will remain a relic of an era when statisticians tried to define causation with the only mental tool available to them: the randomized controlled trial (RCT).

I summed it up in Causality, 2009, p. 361: “To suppress talk about how gender causes the many biological, social, and psychological distinctions between males an females is to suppress 90% of our knowledge about gender differences”

I further elaborated on this issue in (Bollen and Pearl 2014 p. 313) saying:

Pearl (2011) further shows that this restriction has led to harmful consequence by forcing investigators to compromise their research questions only to avoid the manipulability restriction. The essential ingredient of causation, as argued in Pearl (2009: 361), is responsiveness, namely, the capacity of some variables to respond to variations in other variables, regardless of how those variations came about.”

In (Causality 2009 p. 361) I also find this paragraph: “It is for that reason, perhaps, that scientists invented counterfactuals; it permit them to state and conceive the realization of antecedent conditions without specifying the physical means by which these conditions are established;”

All in all, you have touched on one of the most fascinating chapters in the history of science, featuring a respectable scientific community that clings desperately to an outdated dogma, while resisting, adamantly, the light that shines around it. This chapter deserves a major headline in Kuhn’s book on scientific revolutions. As I once wrote: “It is easier to teach Copernicus in the Vatican than discuss causation with a statistician.” But this was in the 1990’s, before causal inference became fashionable. Today, after a vicious 100-year war of reformation, things are begining to change (See http://www.nasonline.org/programs/sackler-colloquia/completed_colloquia/Big-data.html). I hope your upcoming book further accelerates the transition.

April 24, 2015

Flowers of the First Law of Causal Inference (3)

Flower 3 — Generalizing experimental findings

Continuing our examination of “the flowers of the First Law” (see previous flowers here and here) this posting looks at one of the most crucial questions in causal inference: “How generalizable are our randomized clinical trials?” Readers of this blog would be delighted to learn that one of our flowers provides an elegant and rather general answer to this question. I will describe this answer in the context of transportability theory, and compare it to the way researchers have attempted to tackle the problem using the language of ignorability. We will see that ignorability-type assumptions are fairly limited, both in their ability to define conditions that permit generalizations, and in our ability to justify them in specific applications.

1. Transportability and Selection Bias
The problem of generalizing experimental findings from the trial sample to the population as a whole, also known as the problem of “sample selection-bias” (Heckman, 1979; Bareinboim et al., 2014), has received wide attention lately, as more researchers come to recognize this bias as a major threat to the validity of experimental findings in both the health sciences (Stuart et al., 2015) and social policy making (Manski, 2013).

Since participation in a randomized trial cannot be mandated, we cannot guarantee that the study population would be the same as the population of interest. For example, the study population may consist of volunteers, who respond to financial and medical incentives offered by pharmaceutical firms or experimental teams, so, the distribution of outcomes in the study may differ substantially from the distribution of outcomes under the policy of interest.

Another impediment to the validity of experimental finding is that the types of individuals in the target population may change over time. For example, as more individuals become eligible for health insurance, the types of individuals seeking services would no longer match the type of individuals that were sampled for the study. A similar change would occur as more individuals become aware of the efficacy of the treatment. The result is an inherent disparity between the target population and the population under study.

The problem of generalizing across disparate populations has received a formal treatment in (Pearl and Bareinboim, 2014) where it was labeled “transportability,” and where necessary and sufficient conditions for valid generalization were established (see also Bareinboim and Pearl, 2013). The problem of selection bias, though it has some unique features, can also be viewed as a nuance of the transportability problem, thus inheriting all the theoretical results established in (Pearl and Bareinboim, 2014) that guarantee valid generalizations. We will describe the two problems side by side and then return to the distinction between the type of assumptions that are needed for enabling generalizations.

The transportability problem concerns two dissimilar populations, Π and Π, and requires us to estimate the average causal effect P(yx) (explicitly: P(yx) ≡ P(Y = y|do(X = x)) in the target population Π, based on experimental studies conducted on the source population Π. Formally, we assume that all differences between Π and Π can be attributed to a set of factors S that produce disparities between the two, so that P(yx) = P(yx|S = 1). The information available to us consists of two parts; first, treatment effects estimated from experimental studies in Π and, second, observational information extracted from both Π and Π. The former can be written P(y|do(x),z), where Z is set of covariates measured in the experimental study, and the latters are written P(x, y, z) = P (x, y, z|S = 1), and P (x, y, z) respectively. In addition to this information, we are also equipped with a qualitative causal model M, that encodes causal relationships in Π and Π, with the help of which we need to identify the query P(yx). Mathematically, identification amounts to transforming the query expression

P(yx) = P(y|do(x),S = 1)

into a form derivable from the available information ITR, where

ITR = { P(y|do(x),z),  P(x,y,z|S = 1),   P(x,y,z) }.

The selection bias problem is slightly different. Here the aim is to estimate the average causal effect P(yx) in the Π population, while the experimental information available to us, ISB, comes from a preferentially selected sample, S = 1, and is given by P (y|do(x), z, S = 1). Thus, the selection bias problem calls for transforming the query P(yx) to a form derivable from the information set:

ISB = { P(y|do(x),z,S = 1), P(x,y,z|S = 1), P(x,y,z) }.

In the Appendix section, we demonstrate how transportability problems and selection bias problems are solved using the transformations described above.

The analysis reported in (Pearl and Bareinboim, 2014) has resulted in an algorithmic criterion (Bareinboim and Pearl, 2013) for deciding whether transportability is feasible and, when confirmed, the algorithm produces an estimand for the desired effects. The algorithm is complete, in the sense that, when it fails, a consistent estimate of the target effect does not exist (unless one strengthens the assumptions encoded in M).

There are several lessons to be learned from this analysis when considering selection bias problems.

1. The graphical criteria that authorize transportability are applicable to selection bias problems as well, provided that the graph structures for the two problems are identical. This means that whenever a selection bias problem is characterizes by a graph for which transportability is feasible, recovery from selection bias is feasible by the same algorithm. (The Appendix demonstrates this correspondence).

2. The graphical criteria for transportability are more involved than the ones usually invoked in testing treatment assignment ignorability (e.g., through the back-door test). They may require several d-separation tests on several sub-graphs. It is utterly unimaginable therefore that such criteria could be managed by unaided human judgment, no matter how ingenious. (See discussions with Guido Imbens regarding computational barriers to graph-free causal inference, click here). Graph avoiders, should reckon with this predicament.

3. In general, problems associated with external validity cannot be handled by balancing disparities between distributions. The same disparity between P (x, y, z) and P(x, y, z) may demand different adjustments, depending on the location of S in the causal structure. A simple example of this phenomenon is demonstrated in Fig. 3(b) of (Pearl and Bareinboim, 2014) where a disparity in the average reading ability of two cities requires two different treatments, depending on what causes the disparity. If the disparity emanates from age differences, adjustment is necessary, because age is likely to affect the potential outcomes. If, on the other hand the disparity emanates from differences in educational programs, no adjustment is needed, since education, in itself, does not modify response to treatment. The distinction is made formal and vivid in causal graphs.

4. In many instances, generalizations can be achieved by conditioning on post-treatment variables, an operation that is frowned upon in the potential-outcome framework (Rosenbaum, 2002, pp. 73–74; Rubin, 2004; Sekhon, 2009) but has become extremely useful in graphical analysis. The difference between the conditioning operators used in these two frameworks is echoed in the difference between Qc and Qdo, the two z-specific effects discussed in a previous posting on this blog (link). The latter defines information that is estimable from experimental studies, whereas the former invokes retrospective counterfactual that may or may not be estimable empirically.

In the next Section we will discuss the benefit of leveraging the do-operator in problems concerning generalization.

2. Ignorability versus Admissibility in the Pursuit of Generalization

A key assumption in almost all conventional analyses of generalization (from sample-to-population) is S-ignorability, written Yx ⊥ S|Z where Yx is the potential outcome predicated on the intervention X = x, S is a selection indicator (with S = 1 standing for selection into the sample) and Z a set of observed covariates. This condition, sometimes written as a difference Y1 − Y0 ⊥ S|Z, and sometimes as a conjunction {Y1, Y0} ⊥ S|Z, appears in Hotz et al. (2005); Cole and Stuart (2010); Tipton et al. (2014); Hartman et al. (2015), and possibly other researchers committed to potential-outcome analysis. This assumption says: If we succeed in finding a set Z of pre-treatment covariates such that cross-population differences disappear in every stratum Z = z, then the problem can be solved by averaging over those strata. (Lacking a procedure for finding Z, this solution avoids the harder part of the problem and, in this sense, it somewhat borders on the circular. It amounts to saying: If we can solve the problem in every stratum Z = z then the problem is solved; hardly an informative statement.)

In graphical analysis, on the other hand, the problem of generalization has been studied using another condition, labeled S-admissibility (Pearl and Bareinboim, 2014), which is defined by:

P (y|do(x), z) = P (y|do(x), z, s)

or, using counterfactual notation,

P(yx|zx) = P (yx|zx, sx)

It states that in every treatment regime X = x, the observed outcome Y is conditionally independent of the selection mechanism S, given Z, all evaluated at that same treatment regime.

Clearly, S-admissibility coincides with S-ignorability for pre-treatment S and Z; the two notions differ however for treatment-dependent covariates. The Appendix presents scenarios (Fig. 1(a) and (b)) in which post-treatment covariates Z do not satisfy S-ignorability, but satisfy S-admissibility and, thus, enable generalization to take place. We also present scenarios where both S-ignorability and S-admissibility hold and, yet, experimental findings are not generalizable by standard procedures of post-stratification. Rather the correct procedure is uncovered naturally from the graph structure.

One of the reasons that S-admissibility has received greater attention in the graph-based literature is that it has a very simple graphical representation: Z and X should separate Y from S in a mutilated graph, from which all arrows entering X have been removed. Such a graph depicts conditional independencies among observed variables in the population under experimental conditions, i.e., where X is randomized.

In contrast, S-ignorability has not been given a simple graphical interpretation, but it can be verified from either twin networks (Causality, pp. 213-4) or from counterfactually augmented graphs (Causality, p. 341), as we have demonstrated in an earlier posting on this blog (link). Using either representation, it is easy to see that S-ignorability is rarely satisfied in transportability problems in which Z is a post-treatment variable. This is because, whenever S is a proxy to an ancestor of Z, Z cannot separate Yx from S.

The simplest result of both PO and graph-based approaches is the re-calibration or post-stratification formula. It states that, if Z is a set of pre-treatment covariates satisfying S-ignorability (or S-admissibility), then the causal effect in the population at large can be recovered from a selection-biased sample by a simple re-calibration process. Specifically, if P(yx|S = 1,Z = z) is the z-specific probability distribution of Yx in the sample, then the distribution of Yx in the population at large is given by

P(yx) = ∑z  P(yx|S = 1,z)   P(z)  (*)

where P(z) is the probability of Z = z in the target population (where S = 0). Equation (*) follows from S-ignorability by conditioning on z and, adding S = 1 to the conditioning set – a one-line proof. The proof fails however when Z is treatment dependent, because the counterfactual factor P(yx|S = 1,z) is not normally estimable in the experimental study. (See Qc vs. Qdo discussion here).

As noted in (Keiding, 1987) this re-calibration formula goes back to 18th century demographers (Dale, 1777; Tetens, 1786) facing the task of predicting overall mortality (across populations) from age-specific data. Their reasoning was probably as follows: If the source and target populations differ in distribution by a set of attributes Z, then to correct for these differences we need to weight samples by a factor that would restore similarity to the two distributions. Some researchers view Eq. (*) as a version of Horvitz and Thompson (1952) post-stratification method of estimating the mean of a super-population from un-representative stratified samples. The essential difference between survey sampling calibration and the calibration required in Eq. (*) is that the calibrating covariates Z are not just any set by which the distributions differ; they must satisfy the S-ignorability (or admissibility) condition, which is a causal, not a statistical condition. It is not discernible therefore from distributions over observed variables. In other words, the re-calibration formula should depend on disparities between the causal models of the two populations, not merely on distributional disparities. This is demonstrated explicitly in Fig. 4(c) of (Pearl and Bareinboim, 2014), which is also treated in the Appendix (Fig. 1(a)).

While S-ignorability and S-admissibility are both sufficient for re-calibrating pre-treatment covariates Z, S-admissibility goes further and permits generalizations in cases where Z consists of post-treatment covariates. A simple example is the bio-marker model shown in Fig. 4(c) (Example 3) of (Pearl and Bareinboim, 2014), which is also discussed in the Appendix.

Conclusions

1. Many opportunities for generalization are opened up through the use of post-treatment variables. These opportunities remain inaccessible to ignorability-based analysis, partly because S-ignorability does not always hold for such variables but, mainly, because ignorability analysis requires information in the form of z-specific counterfactuals, which is often not estimable from experimental studies.

2. Most of these opportunities have been chartered through the completeness results for transportability (Bareinboim et al., 2014), others can be revealed by simple derivations in do-calculus as shown in the Appendix.

3. There is still the issue of assisting researchers in judging whether S-ignorability (or S-admissibility) is plausible in any given application. Graphs excel in this dimension because graphs match the format in which people store scientific knowledge. Some researchers prefer to do it by direct appeal to intuition; they do so at their own peril.

For references and appendix, click here.

January 22, 2015

Flowers of the First Law of Causal Inference (2)

Flower 2 — Conditioning on post-treatment variables

In this 2nd flower of the First Law, I share with readers interesting relationships among various ways of extracting information from post-treatment variables. These relationships came up in conversations with readers, students and curious colleagues, so I will present them in a question-answers format.

Question-1
Rule 2 of do-calculus does not distinguish post-treatment from pre-treatment variables. Thus, regardless of the nature of Z, it permits us to replace P (y|do(x), z) with P (y|x, z) whenever Z separates X from Y in a mutilated graph GX (i.e., the causal graph, from which arrows emanating from X are removed). How can this rule be correct, when we know that one should be careful about conditioning on a post treatment variables Z?

Example 1 Consider the simple causal chain X → Y → Z. We know that if we condition on Z (as in case control studies) selected units cease to be representative of the population, and we cannot identify the causal effect of X on Y even when X is randomized. Applying Rule-2 however we get P (y|do(x), z) = P (y|x, z). (Since X and Y are separated in the mutilated graph X Y → Z). This tells us that the causal effect of X on Y IS identifiable conditioned on Z. Something must be wrong here.

To read more, click here.

December 22, 2014

Flowers of the First Law of Causal Inference

Filed under: Counterfactual,Definition,General,structural equations — judea @ 5:22 am

Flower 1 — Seeing counterfactuals in graphs

Some critics of structural equations models and their associated graphs have complained that those graphs depict only observable variables but: “You can’t see the counterfactuals in the graph.” I will soon show that this is not the case; counterfactuals can in fact be seen in the graph, and I regard it as one of many flowers blooming out of the First Law of Causal Inference (see here). But, first, let us ask why anyone would be interested in locating counterfactuals in the graph.

This is not a rhetorical question. Those who deny the usefulness of graphs will surely not yearn to find counterfactuals there. For example, researchers in the Imbens-Rubin camp who, ostensibly, encode all scientific knowledge in the “Science” = Pr(W,X,Y(0),Y(1)), can, theoretically, answer all questions about counterfactuals straight from the “science”; they do not need graphs.

On the other extreme we have students of SEM, for whom counterfactuals are but byproducts of the structural model (as the First Law dictates); so, they too do not need to see counterfactuals explicitly in their graphs. For these researchers, policy intervention questions do not require counterfactuals, because those can be answered directly from the SEM-graph, in which the nodes are observed variables. The same applies to most counterfactual questions, for example, the effect of treatment on the treated (ETT) and mediation problems; graphical criteria have been developed to determine their identification conditions, as well as their resulting estimands (see here and here).

So, who needs to see counterfactual variables explicitly in the graph?

There are two camps of researchers who may benefit from such representation. First, researchers in the Morgan-Winship camp (link here) who are using, interchangeably, both graphs and potential outcomes. These researchers prefer to do the analysis using probability calculus, treating counterfactuals as ordinary random variables, and use graphs only when the algebra becomes helpless. Helplessness arises, for example, when one needs to verify whether causal assumptions that are required in the algebraic derivations (e.g., ignorability conditions) hold true in one’s model of reality. These researchers understand that “one’s model of reality” means one’s graph, not the “Science” = Pr(W,X,Y(0),Y(1)), which is cognitively inaccessible. So, although most of the needed assumptions can be verified without counterfactuals from the SEM-graphs itself (e.g., through the back door condition), the fact that their algebraic expressions already carry counterfactual variables makes it more convenient to see those variables represented explicitly in the graph.

The second camp of researchers are those who do not believe that scientific knowledge is necessarily encoded in an SEM-graph. For them, the “Science” = Pr(W,X,Y(0),Y(1)), is the source of all knowledge and assumptions, and a graph may be constructed, if needed, as an auxiliary tool to represent sets of conditional independencies that hold in Pr(*). [I was surprised to discover sizable camps of such researchers in political science and biostatistics; possibly because they were exposed to potential outcomes prior to studying structural equation models.] These researchers may resort to other graphical representations of independencies, not necessarily SEM-graphs, but occasionally seek the comfort of the meaningful SEM-graph to facilitate counterfactual manipulations. Naturally, they would prefer to see counterfactual variables represented as nodes on the SEM-graph, and use d-separation to verify conditional independencies, when needed.

After this long introduction, let us see where the counterfactuals are in an SEM-graph. They can be located in two ways, first, augmenting the graph with new nodes that represent the counterfactuals and, second, mutilate the graph slightly and use existing nodes to represent the counterfactuals.

The first method is illustrated in chapter 11 of Causality (2nd Ed.) and can be accessed directly here. The idea is simple: According to the structural definition of counterfactuals, Y(0) (similarly Y(1)) represents the value of Y under a condition where X is held constant at X=0. Statistical variations of Y(0) would therefore be governed by all exogenous variables capable of influencing Y when X is held constant, i.e. when the arrows entering X are removed. We are done, because connecting these variables to a new node labeled Y(0), Y(1) creates the desired representation of the counterfactual. The book-section linked above illustrates this construction in visual details.

The second method mutilates the graph and uses the outcome node, Y, as a temporary surrogate for Y(x), with the understanding that the substitution is valid only under the mutilation. The mutilation required for this substitution is dictated by the First Law, and calls for removing all arrows entering the treatment variable X, as illustrated in the following graph (taken from here).

This method has some disadvantages compared with the first; the removal of X’s parents prevents us from seeing connections that might exist between Y_x and the pre-intervention treatment node X (as well as its descendants). To remedy this weakness, Shpitser and Pearl (2009) (link here) retained a copy of the pre-intervention X node, and kept it distinct from the manipulated X node.

Equivalently, Richardson and Robins (2013) spliced the X node into two parts, one to represent the pre-intervention variable X and the other to represent the constant X=x.

All in all, regardless of which variant you choose, the counterfactuals of interest can be represented as nodes in the structural graph, and inter-connections among these nodes can be used either to verify identification conditions or to facilitate algebraic operations in counterfactual logic.

Note, however, that all these variants stem from the First Law, Y(x) = Y[M_x], which DEFINES counterfactuals in terms of an operation on a structural equation model M.

Finally, to celebrate this “Flower of the First Law” and, thereby, the unification of the structural and potential outcome frameworks, I am posting a flowery photo of Don Rubin and myself, taken during Don’s recent visit to UCLA.

December 7, 2012

On Structural Equations versus Causal Bayes Networks

Filed under: Counterfactual,structural equations — eb @ 6:00 pm

We received the following query from Jim Grace, (USGS – National Wetlands Research Center) :
Hi Judea,

In your 2009 edition of Causality on pages 26-27 you explain your reasoning for now preferring to express causal rules from a Laplacian quasi-deterministic perspective rather than stay with the stochastic conceptualization associated with Bayesian Networks. It seems to me that a practical matter here is the reliance of traditional graph theory on discrete mathematics and the constraints that places on functional forms and, therefore, counterfactual arguments. Despite that clear logic, one sees the occasional discussion of “causal Bayes nets” and I wondered if you would dissuade people (if people can be dissuaded) from trying to evolve a causal modeling methodology with discrete Bayes nets as their starting point?

Judea Pearl answers:
Dear Jim,

I would not dissuade people from using either causal Bayesian causal networks or structural equation models, because the difference between the two is so minute that it is not worth the dissuasion. The question is only what question you ask yourself when you construct the diagram. If you feel more comfortable asking: What factors determine the value of this variable” then you construct a structural equation model. If on the other hand you prefer to ask: “If I intervene and wiggle this variable, would the probability of the other variable change?” then the outcome would be a causal Bayes network. Rarely do they differ (but see example on page 35 of Causality).

Next Page »

Powered by WordPress