big data, statistics

Honey, I shrunk the statistician

In my Ph.D. program, I learned all about how to analyze small data. I learned rules of thumb for how much data I needed to run a particular analysis, and what to do if I didn’t have enough. I worked with what seemed now (and even seemed then) to be toy data sets, except they weren’t toys, because when you’re running a psychological experiment you might be lucky to have 30 participants and when you’re analyzing the whole world’s math performance (think TIMSS), you can do it with a data set less than a gigabyte in size.

image courtesy Victorian Web

I did some program evaluation while I finished my degree and sometimes colleagues would lament, “we don’t have enough data!” Usually, we did. We could thank Sir Ronald Aylmer Fisher for that. He worked with small data in agricultural settings and gave us the venerable ANOVA technique, which works fine with just a handful of cases per group (assuming balanced group sizes, normality, and homoscedasticity). Maybe we might give a nod to William Sealy Gosset, too, for introducing Student’s t distribution, which helps when the Central Limit Theorem hasn’t kicked in yet and brought us to normality.

But Sir Ronald and Student can’t help me now. I’m down a rabbit hole… in some sort of web-scale wonderland of big data. I feel like Alice after drinking the magic potion, too small to reach the key on the table. The data is so much broader and bigger than I am, so much broader and bigger than my puny methods and my puny desktop R environment that wants to suck everything into memory in order to analyze it.

I stay awake at night thinking how to analyze all this data and deliver on its promise, how to analyze across schools and courses and so many, many students, not to mention all their clickstreams. How can I get through the locked door and experience the rest of wonderland when I’m so small and the data’s so big? I could sample the data, I think, and then I’d be in the realm where I’m comfortable, dealing with sampling distributions and generalizing to a population and applying the small-data methods I know already. Perhaps I can extract it by subset — by subsets of like courses, perhaps, or by school (I’m doing that already–not scalable and doesn’t address some of the most interesting questions). What about trying out Revolutions’ big data support for R? Or maybe I can apply haute big-data techniques: Hadoop-ify it (Hive, Pig, HBase???) then use simplistic (embarrassingly parallel) algorithms with MapReduce. Problem is, none of the methods I like to use and seem appropriate for educational settings (multilevel modeling for example) are easily parallelized. I’m stumped.

It’s okay to be stumped, I think — part of creation is living with uncertainty:

Most people who are not consummate creators avoid tension. They want quick answers. They don’t like living in the realm of not knowing something they want to know. They have an intolerance for those moments in the creative process in which you have no idea how to get from where you are to where you want to be. Actually, this is one of the very best moments there are. This is when something completely original can be born, when you go beyond your usual ways of addressing similar situations, where you can drive the creative process into high gear. [Robert Fritz on supercharging the creative process]

Alice ate some cake that made her bigger. Is there some cake that will make me and my methods big enough to answer the questions I want answered? For now I’m in the realm of not knowing but I hope in 2012 I will have some answers: first, answers about how to make myself big again, and second, answers from the data.