### On Theories and Approaches: Discussion with Andrew Gelman

**Judea Pearl writes to Andrew Gelman about differences between Donald Rubin's and Pearl's approaches.**

Dear Andrew,

I think our discussion could benefit from the distinction between "theories" and “approaches." A theory T is a set of mathematical constraints on what can and cannot be deduced from a set of premises. An approach is what you do with those constraints, how you apply them, at what sequence, and in what language.

In the context of this distinction I say that Rubin’s theory T is equivalent to Pearl’s. While the approach is different, equivalence of theories means that there cannot be a clash of claims, and this is a proven fact. In other words if there is ever a clash about a given problem, it means one of two things, either the theory was not applied properly or additional information about the problem was assumed by one investigator that was not assumed by the other.

Now to the "approach". Below is my analysis of the two approaches, please check if it coincide with your understanding of Rubin's approach.

Pearl says, let us start with the science behind each problem, e.g., coins, bells, seat-belts, smoking etc.. Our theory tells us that no causal claim can ever be issued if we know nothing about the science, even if we take infinite samples. Therefore, let us articulate what we do know about the science, however meager, and see what we can get out of the theory. This calls for encoding the relationships among the relevant entities, coins, bells and seat-belts, in some language, call it L, thus creating a "problem description" L(P). L(P) contains variables, observed and unobserved factors, equations, graphs, physical constraints, processes, influences, lack of influences, dependencies, etc, whatever is needed to encode our understanding of the science behind the problem P.

Now we are ready to solve the problem. We take L(P) and appeal to our theory T:Theory, theory on the wall, how should we solve L(P)? The theory says: Sorry, I don’t speak L, I speak T.

What do we do? Pearl's approach says: take the constraints from T, and translate them into new constraints, formulated in language L, thus creating a set of constraints L(T) that echo T and tell us what can and what cannot be deduced from certain premises encoded in L(P).Next, we deduced a claim C in L(P) (if possible)or we proclaim C to be "non-deducible". Done.

Rubin's approach is a bit different. We again look at a problem P but, instead of encoding it in L, we skip that part and translate P directly into a language that the theory can recognize; call it T(P). (It looks like P(W|X, Y_{1}, Y_{2}) according to Rubin's SIM article (2007)) Now we ask: Theory, theory on the wall, how should we solve T(P)? The theory answers: Easy, man! I speak T. So, the theory produces a claim C in T, and everyone is happy.

To summarize, Pearl brings the theory to the problem, Rubin takes the problem to the theory.

To an observer from the outside the two approaches would look identical, because the claims produced are identical and the estimation procedures they dictate are identical. So, one should naturally ask, how can there ever be a clash in claims like the one concerning covariate selection?

Differences will show up when researchers begin to deviate from the philosophies that govern either one of the two approaches. For example, researchers might find it too hard to go from P to T(P). So hard in fact that they give up on thinking about P, and appeal directly to the theory: Theory, theory on the wall, we don’t know anything about the problem, actually, we do know, but we don’t feel like thinking about it. Can you deduce claim C for us?

If asked, the theory would answer: "No, sorry, nothing can be deduced without some problem description. "But some researchers may not wish to talk directly to the theory, it is too taxing to write a story and coins and bells in language of P(W|X, Y_{1}, Y_{2})..So what do they do? They fall into a lazy mode, like: "Use whatever routines worked for you in the past. If propensity scores worked for you, use it, take all available measurements as predictors. the more the better." Lazy thinking forms subcultures, and subcultures tend to isolate themselves from the rest of the scientific community because nothing could be more enticing than methods and habits, especially when they reinforced by respected leaders, And especially when habits are supported by convincing metaphors. For example, how can you go wrong by "balancing" treated and untreated units on more and more covariates. Balancing, we all know, is a good thing to have; is even present in randomized trials. So, how can we go wrong? An open-minded student of such subculture should ask: "The more the better? Really? How come? Pearl says some covariates might increase bias? And there should be no clash in claims between the two approaches. "An open minded student would also be so bold as to take a pencil and paper and consult the theory T directly, asking: Do I have to worry about increased bias in my specific problem?" And the theory would answer: You might have to worry, yes, but I can only tell you where the threats are if you tell me something about the problem, which you refuse to do.

Or the theory might answer: If you feel so shy about describing your problem, why don’t you use the Bayesian method; this way, even if you end up with unidentified situation, the method would not punish you for not thinking about the problem, it would just produce a very wide posterior, The more you think, the narrower the posterior. Isn't this a fair play?

To summarize:

One theory has spawned two approaches, The two approaches have spawned two subcultures.Culture-1 solves problems in L(P) by the theoretical rules of L(T) that were translated from T into L. Culture-2 avoids describing P, or thinking about P, and relies primarily on metaphors, convenience of methods and guru's advise.

Once in a while, when problems are simple enough, (like the binary Instrumental Variable problem), someone from culture 2 would formulate a problem in T and derive useful results. But, normally, problem-description avoidance is the rule of the day. So much so, that even 2-coins-one-bell problems are not analyzed mathematically by rank and file researches; they are sent to the gurus for opinion.

I admit that I was not aware of the capability of Bayesian methods to combine two subpopulations in which a quantity is unidentified and extract a point estimate of the average, when such average is identified. I am still waiting for the bell-coins example worked out by this method — it would enrich by arsenal of techniques. But this would still not alter my approach, namely, to formulate problems in a language close to their source: human experience.

In other words, even if the Bayesian method will be shown capable of untangling the two subpopulations, thus giving researchers the assurance that they have not ignored any data, I would still prefer to encode a problem in L(P), then ask L(T): Theory, theory on the wall, look at my problem and tell me if perhaps there are measurements that are redundant. If the answer is Yes, I would save the effort of measuring them, and the increased dimensionality of regressing on them, and just get the answer that I need from the essential measurements. Recall that, even if one insists on going the Bayesian route, the task of translating a problem into T remains the same. All we gain is the luxury of not thinking in advance about which measurements can be avoided, we let the theory do the filtering

automatically. I am now eager to see how this is done; two-cons and one bell. Everyone knows the answer: coin-1 has no causal effect on coin-2 no matter if we listen to the bell or not. Lets see Rev. Bayes advise us correctly: ignore the bell.