An interesting math and causality-minded club
from Adam Kelleher:
The math and algorithm reading group (http://www.meetup.com/Math-and-Algorithm-Reading-Group/) is based in NYC, and was founded when I moved here three years ago. It’s a very casual group that grew out of a reading group I was in during graduate school. Some friends who were math graduate students were interested in learning more about general relativity, and I (a physicist) was interested in learning more math. Together, we read about differential geometry, with the goal of bringing our knowledge together. We reasoned that we could learn more as a group, by pooling our different perspectives and experience, than we could individually. That’s the core motivation of our reading group: not only are we there to help resolve each other get through the material if anyone gets stuck, but we’re also there to add what else we know (in the format of a group discussion) to the content of the material.
We’re currently reading Causality cover to cover. We’ve paused to implement some of the algorithms, and plan on pausing again soon for a review session. We intend to do a “hacking session”, to try our hands at causal inference and analysis on some open data sets.
Inspired by reading Causality, and realizing that the best open implementations of causal inference were packaged in the (old, relatively inaccessible) Tetrad package, I’ve started a modern implementation of some tools for causal inference and analysis in the causality package in Python. It’s on pypi (pip install causality, or check the tutorial on http://www.github.com/akelleh/causality), but it’s still a work in progress. The IC* algorithm is implemented, along with a small suite of conditional independence tests. I’m adding some classic methods for causal inference and causal effects estimation, aimed at making the package more general-purpose. I invite new contributions to help build out the package. Just open an issue, and label it an “enhancement” to kick of the discussion!
Finally, to make all of the work more accessible to people without more advanced math background, I’ve been writing a series of blog posts aimed at introducing anyone with an intermediate background in probability and statistics to the material in Causality! It’s aimed especially at practitioners, like data scientists. The hope is that more people, managers included (the intended audience for the first 3 posts), will understand the issues that come up when you’re not thinking causally. I’d especially recommend the article about understanding bias https://medium.com/@akelleh/understanding-bias-a-pre-requisite-for-trustworthy-results-ee590b75b1be#.qw7n8qx8d, but the whole series (still in progress) is indexed here: https://medium.com/@akelleh/causal-data-science-721ed63a4027#.v7bqse9jh
TETRAD source code in java is available on github: https://github.com/cmu-phil/tetrad. Nothing inaccessible about it. Rather than rewriting the wheel, why not beat the hell out of the algorithms already implemented, or find better ones? There has been a ton of work on causal search. Google it.
Clark Glymour
Comment by c glymour — September 15, 2016 @ 12:39 pm
I understand your frustration: some exceptional work has been done in Tetrad, and another package coming in probably looks like a bad job at competition. That’s not my goal at all.
I started working with causal inference only 3 years ago, when I started working as a Data Scientist. I realized that causal inference lead to the best achievable answer in certain contexts, and so I’ve actually found it useful in practice. Unfortunately, I’ve also found it difficult to use in practice. I could make specific, debatable complaints (N=1) about my own difficulties working with the Tetrad documentation, coming at it as a practitioner with a level of math sophistication that’s far beyond the average in the field, but there are probably less debatable issues I could focus on. Before outlining my main criticisms, I want to say that Tetrad really is an outstanding package, and trying to fix some of the issues I’ve had working with it are really what inspired me to develop a package of my own. The work you and others have done is nothing short of monumental, and I’d love to work with you on a tool that everyone would find useful, if you’re open to it. With that said, here are the main issues I’ve run into when using the package in practice:
(1) There are foundational issues with inference on multiple data types that aren’t apparently properly being addressed in the package. I experienced this when I first used it, and am still experiencing it after pulling the latest version today. Presumably, when you’re using continuous and discrete data together, you can discretize the continuous data and run a discrete conditional independence test (CIT). Unfortunately, this doesn’t work in practice. Try a markov chain with discrete -> continuous -> discrete variables. Discretize the middle variable: you lose information that the middle variable contains about the far left variable, and so the discretized continuous variable no longer blocks the chain. This is a serious problem! It’s clear that the package’s development focus is on new and interesting search/simulation algorithms, and less on the boring things, like data pre-processing and CITs. I’ve tried to address this in my package by providing a new critical value for a chi^2 test on a discretized variable (using _slow_ kernel density estimation to give a null model for what independence looks like after discretization), but it’s a work in progress. It’s too slow. The basic takeaway is: if tetrad does address this problem, it’s far from clear from reading the documentation how to do it.
(2) The package is written as a standalone package in Java. While I could import it into a Java program, it’s not _easy_, and it’s certainly not easy to build a production app in Python that, e.g. let’s me query a causal graph, or do causal effects estimation on the fly. There’s a thriving Python and R ecosystem, and Tetrad has some basic bugs that reflect a lack of an ecosystem around it. Try loading in a *.csv file with an indexed, unlabeled first column.
(3) The language issue might seem superficial: you could port it over to python like in py-causal, and achieve a similar effect, right? I don’t believe it’s quite so, and this is something I debated with a friend early on when I decided to build this package. You really have to think about the workflow of a data scientist, and how the package would be used and contributed to in a real context. This package isn’t like numpy, implementing basic math functionality — it’s a cutting edge research tool, and will need active development at all levels. As usual, there’ll be a long-tailed distribution of interest in development of the package. Most users will want to drop data in and get answers out. The easier it is to contribute (by writing in a fast-to-develop language like python), the more people will contribute.
As for workflow, look at sklearn for what a good API might look like: they have a simple model.fit(X[,y]) method that takes data in a matrix (or dataframe!), X, and all of the work is done “under the hood”. Tetrad is far from this level of use-ability, and I think that’s one of the barriers to broader adoption and exploration.
Finally, production apps are very often written in python, and the more cross-language dependencies there are, the more space there is to complicate writing production applications.
I have great hope for what observational causal inference can do once it’s released in a way that’s easy to use for even junior data scientists. Consider the sheer number of people who would be using it, and the number of data sets it would be operating on. The technology for doing this kind of inference could advance _rapidly_ if we just make it ridiculously easy to use. If we get people to think in terms of causal graphs, we’ll be doing an enormous amount of good just with that end. That summarizes the main goal of my package: making causal inference ridiculously easy for data scientists. Tetrad really is an amazing package, but its documentation could use some work, and it’s not really built for the data science community.
Finally, it could very well be the case that porting the code over from Java to python makes the interface good enough to attract so many people that all you need is the really dedicated contributors among them to refine/support the package. It looks like py-causal is off to a great start! It began a month after I started my package — it may have influenced my starting it if it had been a month earlier!
Comment by Adam Kelleher — September 21, 2016 @ 1:50 pm