Causal Analysis in Theory and Practice

January 24, 2018

Can DAGs Do the Un-doable?

Filed under: DAGs,Discussion — Judea Pearl @ 2:32 am

The following question was sent to us by Igor Mandel:

Separation of variables with zero causal coefficients from others
Here is a problem. Imagine, we have a researcher who has some understanding of the particular problem, and this understanding is partly or completely wrong. Can DAG or other (if any) causality theory convincingly establish this fact (that she is wrong)?

To be more specific, let’s consider a simple example with kind of undisputable causal variables (described in details in https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2984045 ). One wants to estimate, how different food’s ingredients affect the energy (in calories) containing in different types of food. She takes many samples and measures different things. But she doesn’t know about existence of the fats and proteins – yet she knows, that there are carbohydrates, water and fiber. She builds a respective DAG, how she feels it should be:

From our (i.e. educated people of 21st century) standpoint the arrows from Fiber and Water to Calories have zero coefficients. But since data bear significant correlations between Calories, Water and Fiber – any regression estimates would show non-zero values for these coefficients. Is there way to say, that these non-zero values are wrong, not just quantitatively, but kind of qualitatively?
Even brighter example of what is often called “spurious correlation”. It was “statistically proven” almost 20 years ago, that storks deliver babies ( http://robertmatthews.org/wp-content/uploads/2016/03/RM-storks-paper.pdf ) – while many women still believe they do not. How to reconvince those statistically ignorant women? Or – how to strengthen their naïve, but statistically not confirmed beliefs, just looking at the data and not asking them for some babies related details? What kind of DAG may help?

My Response
This question, in a variety of settings, has been asked by readers of this blog since the beginning of the Causal Revolution. The idea that new tools are now available that can handle causal problems free of statistical dogmas has encouraged thousands of researchers to ask: Can you do this, or can you do that? The answers to such questions are often trivial, and can be obtained directly from the logic of causal inference, without the details of the question. I am not surprised however that such questions surface again, in 2018, since the foundations of causal inference are rarely emphasized in the technical literature, so they tend to be forgotten.

I will answer Igor’s question as a student of modern logic of causation.

1. Can a DAG distinguish variables with zero causal effects (on Y) from those having non-zero effects.

Of course not, no method in the world can do that without further assumption. Here is why:
The question above concerns causal relations. We know from first principle that no causal query can be answered from data alone, without causal information that lies outside the data.
QED
[It does not matter if your query is quantitative or qualitative, if you address it to a story or to a graph. Every causal query needs causal assumptions. No causes in – no causes out (N. Cartwright)]

2. Can DAG-based methods do anything more than just quit with failure?

Of course they can.

2.1 First notice that the distinction between having or not having causal effect is a property of nature, (or the data generating process), not of the model that you postulate. We can therefore ignore the diagram that Igor describes above. Now, in addition to quitting for lack of information, DAG-based methods would tell you: “If you can give me some causal information, however qualitative, I will tell you if it is sufficient or not for answering your query.” I hope readers would agree with me that this kind of an answer, though weaker than the one expected by the naïve inquirer, is much more informative than just quitting in despair.

2.2 Note also that postulating a whimsical model like the one described by Igor above has no bearing on the answer. To do anything useful in causal inference we need to start with a model of reality, not with a model drawn by a confused researcher, for whom an arrow is nothing more than “data bears significant correlation” or “regression estimates show non-zero values.”

2.3 Once you start with a postulated model of reality, DAG-based methods can be very helpful. For example, they can take your postulated model and determine which of the arrows in the model should have a zero coefficient attached to it, which should have a non-zero coefficient attached to it, and which would remain undecided till the end of time.

2.4 Moreover, assume reality is governed by model M1 and you postulate model M2, different from M1. DAG-based methods can tell you which causal query you will answer correctly and which you will
answer incorrectly. (see section 4.3 of http://ftp.cs.ucla.edu/pub/stat_ser/r459-reprint-errata.pdf ). This is nice, because it offers us a kind of sensitivity analysis: how far should reality be from your assumed model before you will start making mistakes?

2.5 Finally, DAG-based methods identify for us the testable implication of our model, so that we can test models for compatibility with data.

I am glad Igor raised the question that he did. There is a tendency to forget fundamentals, and it is healthy to rehearse them periodically.

– Judea

Powered by WordPress