On the Classification and Subsumption of Causal Models
From Christos Dimitrakakis:
>> To be honest, there is such a plethora of causal models, that it is not entirely clear what subsumes what, and which one is equivalent to what. Is there a simple taxonomy somewhere? I thought that influence diagrams were sufficient for all causal questions, for example, but one of Pearl’s papers asserts that this is not the case.
Reply from J. Pearl:
Dear Christos,
From my perspective, I do not see a plethora of causal models at all, so it is hard for me to answer your question in specific terms. What I do see is a symbiosis of all causal models in one framework, called Structural Causal Model (SCM) which unifies structural equations, potential outcomes, and graphical models. So, for me, the world appears simple, well organized, and smiling. Perhaps you can tell us what models lured your attention and caused you to see a plethora of models lacking subsumption taxonomy.
The taxonomy that has helped me immensely is the three-level hierarchy described in chapter 1 of my book Causality: 1. association, 2. intervention, and 3 counterfactuals. It is a useful hierarchy because it has an objective criterion for the classification: You cannot answer questions at level i unless you have assumptions from level i or higher.
As to influence diagrams, the relations between them and SCM is discussed in Section 11.6 of my book Causality (2009), Influence diagrams belong to the 2nd layer of the causal hierarchy, together with Causal Bayesian Networks. They lack however two facilities:
1. The ability to process counterfactuals.
2. The ability to handle novel actions.
To elaborate,
1. Counterfactual sentences (e.g., Given what I see, I should have acted differently) require functional models. Influence diagrams are built on conditional and interventional probabilities, that is, p(y|x) or p(y|do(x)). There is no interpretation of E(Y_x| x’) in this framework.
2. The probabilities that annotate links emanating from Action Nodes are interventional type, p(y|do(x)), that must be assessed judgmentally by the user. No facility is provided for deriving these probabilities from data together with the structure of the graph. Such a derivation is developed in chapter 3 of Causality, in the context of Causal Bayes Networks where every node can turn into an action node.
Using the causal hierarchy, the 1st Law of Counterfactuals and the unification provided by SCM, the space of causal models should shine in clarity and simplicity. Try it, and let us know of any questions remaining.
Judea
I suppose that ‘counterfactuals’ are the main sticking point. It is not entirely clear what a counterfactual should really mean. One way of parsing your intuition that “Given what I see, I should have acted differently” is to start thinking about different utility maximisation actions depending on the information set, for example.
But I’ll get back to you when I’ve learned some more.
Comment by Christos Dimitrakakis — September 19, 2016 @ 7:11 pm
Dear Christos,
Counterfactuals are natural conclusioins of physical laws.
If you have a law like F = m a, and you see a force F1 resulting in accelaration a1,
you can conclude: Had the force been 2F1, the accelation would have been 2a1.
Nothing fancy about that. Just interpreting what the law says.
Judea
Comment by Judea Pearl — September 20, 2016 @ 1:35 pm
Dear Judea,
Many thanks for your prompt responses. I appreciate the help. Just one conceptual question: In reinforcement learning, we might have data collected from some policy acting on a Markov decision process and would like to speculate on the effect that a different policy would have had [1]. Would you consider that a counterfactual?
As a further note: if I take the P(y_x | x’, y’) example in the linked notes, I would differentiate between the random variables of the counterfactual and the ones observed. I would probably define a sequence of random variables X_t, Y_t and say: OK, I’ve observed values of X_1, …, X_{t-1} and Y_1, …, Y_{t-1}. What’s P(y_t | x_t, x_{t-1}, …, x_1, y_{t-1}, y_1)? This seems cleaner and avoids the confusion of having two different values for the same random variable, but I am not sure if it’s conceptually the same.
[1] The set up is that the policy P and the process M jointly result in a trajectory T, and we’d like to guess a distribution over trajectories for a new policy P’.
Comment by Christos Dimitrakakis — October 25, 2016 @ 12:44 am