Links for February 24, 2012

Cognitive inequality [The Economist Free Exchange].

is this an iron rule of innovation in information technology—that the cheaper information becomes and the easier it becomes to manipulate it the greater will be the gap, productive and otherwise, between the informationally capable and the rest? …

We might well be in an intial phase of the information age in which technology amplifies cognitive gaps which gives way to a period in which technology mutes those gaps.

Our greedy colleges 2.0 [Andrew Gillen/Inside Higher Ed]. The Bennett Hypothesis says that increases in federal financial aid subsidies enable colleges to raise their tuition without concern for what students can actually afford. Study described here found that aid directed to low-income students is less likely to lead to tuition increases compared to aid directed at relatively affluent students.

A modeled student [Cathy O'Neil/mathbabe]. Do systems that recommend courses and majors for students reinforce discrimination?

Economics of the cold start problem in talent discovery [John Horton/Online Labor]. Novices can’t get hired if their talent won’t be revealed until after they get hired. Some empirical evidence. One possible help: “talent revealing sites like StackOverflow and Github as replacements for traditional resumes.”

Links for February 22, 2012

Elizabeth Gilbert on What the Porcupine Dilemma Can Teach Us About the Secret of Happiness [Maria Popova/Brain Pickings]. Elizabeth Gilbert on Schopenhauer’s porcupines. Staying warm without impaling yourself on someone else’s spines.

Target, Pregnancy, and Predictive Analytics, Part II [Dean Abbott/Data Mining and Predictive Analytics. The Target story was interesting for what it says about the possibilities and perils of analytics. This was my favorite writeup, for its overview of to succeed with data analysis:

1) understand the data,
2) understand why the models are focusing on particular input patterns,
3) ask lots of questions (why does the model like these fields best? why not these other fields?)
4) be forensic (now that's interesting or that's odd...I wonder...),
5) be prepared to iterate, (how can we predict better for those customers we don't characterize well)
6) be prepared to learn during the modeling process

We have to "notice" patterns in the data and connect them to behavior. This is one reason I like to build multiple models: different algorithms can find different kinds of patterns. Regression is a global predictor (one continuous equation for all data), whereas decision trees and kNN are local estimators.

You Are Responsible for Getting Your Ideas to Spread [Tim Kastelle/Innovation Leadership Network]. Don’t blame the customer if your idea isn’t compelling; that’s a failure of your idea or your communication of it.

Machine Learning for Hackers [Review from David Smith/Revolution Analytics blog]. Sounds like a book I need to order.

Rather than merely providing a “cookbook” approach to say, building a “who to follow” recommendation system for Twitter, it takes the time to explain the methodology behing the algorithms and give the reader a better basis for understanding why these methods work (and, equally importantly, how they can go wrong).

What’s new? Exuberance for novelty has benefits [John Tierney/The New York Times]. In a longitudinal study, people who combined novelty-seeking with persistence and “self-transcendence” showed the most success over the years (good health, lots of friends, few emotional problems, greatest satisfaction with life).

So you call yourself a data scientist?

Hilary Mason (in Glamour!)

I just watched this video of Hilary Mason* talking about data mining. Aside from the obvious thoughts of what I could have done with my life if (1) I had majored in computer science instead of philosophy/economics and (2) hadn’t spent all of the zeroes having babies, buying/selling houses, and living out an island retirement fantasy thirty years before my time, I found myself musing about her comments on the “data scientist” term. She said she’s gotten into arguments about it. I guess some people think it doesn’t really mean anything — it’s just hype — who needs it? Someone’s a computer scientist or a statistician or a business intelligence analyst, right? Why make up some new name?

I dunno, I rather like the term. My official title at work is “data scientist” — thank you to my management for that — and it seems more appropriate than statistician or business intelligence analyst or senior software developer or whatever else you might want to call me. The fact is, I do way more than statistical analysis. I know SQL all too well and (as my manager knows from my frequent complaints) spend 75% + of my time writing extract-transform-load code. I use traditional statistical methods like factor analysis and logistic regression (heavily) but if needed I use techniques from machine learning. I try to keep on top of the latest online learning research and I incorporate that into our analytics plans and models. Lately I’ve been spending time looking at what sort of big data architectures might support the scale of analytics we want to do. I don’t just need to know what statistical or ML methods to use — I need to figure out how to make them scalable and real-time and — this is critical — useful in the educational context. That doesn’t sound like pure statistics to me, so don’t just call me a statistician**.

I do way more than data analysis and I’m capable of way more, thanks to my meandering career path that’s taken me from risk assessment (heavy machinery accident analysis at Failure Analysis now Exponent) to database app development (ERP apps at Oracle) to education (AP calculus and remedial algebra teaching at the Denver School of Science and Technology) and now to Pearson (online learning analytics). I earned a couple of degrees in mathematical statistics and applied statistics/research design/psychometrics meanwhile. 

Drew Conway's Venn diagram of data science

None of what I did made sense at the time I was wandering the path — and yet it all adds up to something useful and rare in my current position. Data science requires an alchemistic mixture of domain knowledge, data analysis capability, and a hacker’s mindset (see Drew Conway’s Venn diagram of data science reproduced here). Any term that only incorporates one or two of these circles doesn’t really capture what we do. I’m an educational researcher, a statistician, a programmer, a business analyst. I’m all these things.

In the end, I don’t really care what you call me, so long as I get the chance to ask interesting questions, gather the data to answer them, and then give you an answer you can use — an answer that is grounded in quantitative rigor and human meaning.


*Yes, I do have a girl-crush on Hilary. I think she’s awesome.

** Also, my kids cannot seem to pronounce the word “statistician.” I need a job title they can tell people without stumbling over it. I hope to inspire them to pursue careers that are as rewarding and engaging, intellectually and socially, as my own has been.

Getting ready for connected learning

Here’s a cool idea: the web enables a connectivist learning style based on network navigation, where “learning is the process of creating connections and developing a network.” Seems to me before you can learn connectedly, though, you need to first learn in more socially and contextually constrained ways.

Background: Three generations of distance education pedagogies

In this week’s Learning Analytics 2012 (LAK12) web session, Dragan Gasevic pointed us at an interesting paper describing three generations of distance education: cognitive-behaviorist, social constructivist, and connectivist. From Anderson and Dron (2011):

Anderson and Dron did not claim that the connectivist model would replace the cognitive-behaviorist or social-constructivist models but said that “all three current and future generations of [distance education] pedagogy have an important place in a well-rounded educational experience.”

These three models co-exist online today

LAK12 is itself an example of a course built in the connectivist paradigm, but just because a course is massive, open, and online doesn’t mean that it’s connectivist. For example, the Stanford machine learning class offered last fall was a (very effective) example of a cognitive-behaviorist approach. Students watched videos on their own schedule. Regular quizzes and homework assignments checked understanding. Andrew Ng was content creator and sage on the stage. While there was a Q&A forum available, the course design did not rely on them. A student could use them or not.

Typical online college courses today are often built in the social-constructivist mode, with instructors seeking to design and run courses that encourage many-to-many engagement through discussion threads and group projects. Does the addition of social features drive learning? It seems to be an article of faith among instructional designers today that it does. I’m not up on the research so I can’t say — but I can say that in online courses I’ve reviewed and taken, I don’t see evidence that social features have been designed in such a way that they make a difference in learning.

When are the different approaches useful?

I am thinking that whether a cognitive-behaviorist or constructivist or connectivist approach is best depends upon the preparation and goals of the learner. Maybe something like this:

I suspect that a student needs to gain basic grounding and fluency in a subject before constructivist approaches will be useful. An elementary schooler needs to learn to read and write and do arithmetic before you can do a group science project, for example. And it seems like a connectivist approach will be most effective once you already have some intermediate and contextual knowledge of a subject before trying to navigate out from it.

What do you think? When are cognitive-behaviorist vs. social constructivist vs. connectivist approaches to learning most useful? Do you think you need to have achieved a certain level of contextual and subject knowledge before connected learning is effective?

Links for January 20, 2012

Big data market survey: Hadoop solutions [Edd Dumbill/O'Reilly Radar].

Apache Hadoop is unquestionably the center of the latest iteration of big data solutions. At its heart, Hadoop is a system for distributing computation among commodity servers. It is often used with the Hadoop Hive project, which layers data warehouse technology on top of Hadoop, enabling ad-hoc analytical queries.

I’m starting my first ever project with Hadoop this week–a prototype of an analytics warehouse using Amazon Elastic MapReduce. Colleagues have told me EMR is a great way to get your head around Hadoop-based data processing.

CBO Report: Medicare pilot programs don’t control health-care costs [Megan McArdle/The Atlantic blogs]. McArdle describes what happened with a housing-project demolition program whose pilot studies suggested  much better effects than were actually seen at scale:

The initial study was small and involved highly screened people with a lot of support. And it seems to have suffered from publication bias–the most spectacular results got the most attention, even though these might just have been outliers.

This is distressingly common–not just in government or social-do-gooding research, but in organizations of all kinds–including corporations.

Programs at scale often don’t show results as good as pilot studies of those programs. More generally in program evaluation, it’s hard to find evidence of strong (or even weak) effects of interventions. Social systems are complex; factors other than those targeted by the intervention often determine outcomes. This is something I need to communicate regularly to my colleagues and our partners–student learning is largely determined by factors other than what we have control over. That’s not to say we shouldn’t improve our course design, teaching practices, and so forth but it is to say that there aren’t many easy pickings out there for improving student outcomes.

For-profits vs not-for-profits [Felix Salmon/Reuters blog].

I know full well that a lot of not-for-profit organizations are run in a dreadful fashion; I’m just not convinced that introducing a profit motive is always or even often the best way to fix that problem…. I very much doubt that for-profit education is ever a good idea. I just don’t see how the incentives there could possibly be aligned.

But the profit motive can’t provide optimal outcomes if there isn’t consumer discipline along with it. For-profit higher education is subsidized by the government in the form of grants and low-interest loans (and note that nonprofit education is subsidized in additional ways as well, in the case of public institutions). Would-be students do not have an incentive to seriously evaluate whether the education they are purchasing is worth what they pay, because there is a third-party payer involved. The situation is much like health care. Good discussion in post of the issues and controversy over for-profit higher education.

Links for January 15, 2012

The rise of the new group think [Susan Cain/New York Times].

Virtually all American workers now spend time on teams and some 70 percent inhabit open-plan offices, in which no one has “a room of one’s own.” During the last decades, the average amount of space allotted to each employee shrank 300 square feet, from 500 square feet in the 1970s to 200 square feet in 2010….

Privacy also makes us productive. In a fascinating study known as the Coding War Games, consultants Tom DeMarco and Timothy Lister compared the work of more than 600 computer programmers at 92 companies. They found that people from the same companies performed at roughly the same level — but that there was an enormous performance gap between organizations. What distinguished programmers at the top-performing companies wasn’t greater experience or better pay. It was how much privacy, personal workspace and freedom from interruption they enjoyed. Sixty-two percent of the best performers said their workspace was sufficiently private compared with only 19 percent of the worst performers. Seventy-six percent of the worst programmers but only 38 percent of the best said that they were often interrupted needlessly.

I work in an open-plan office and I rather like it, mainly because my coworkers are fun and because my clean, small, mostly quiet work area is such a nice change from my sprawling, messy, mostly noisy house. We work on a puzzle together when we’re taking a break from work and wear headphones when we want uninterrupted time. I wonder, though, if I’d be more productive with a private office or even a cubicle. I don’t achieve flow as much I’d like at work. Not sure if that’s because the job is relatively new to me or because the work environment is an obstacle.

Hume, causation & science [Barry Ritholtz/The Big Picture]. “We humans love a grossly over-simplified narrative.” Determining when we can attribute causation to a correlation is one of the major challenges of research design and statistical analysis.

How to work from home like you mean it [Kevin Purdy/Fast Company]. I’m thinking of working one day a week at home to achieve some of that flow I’ve been missing. If I do, I’ll follow some of these tips so it doesn’t devolve into eight hours of Internet surfing.

Lack of interest and aptitude keeps students out of STEM majors [Olga Khazan/Washington Post On Small Business blog]. “A study released this week by Georgetown University’s Center on Education and the Workforce found that recent graduates in computer science, mathematics and engineering all had unemployment rates below 9 percent (with the rates dropping below 6 percent among those who had some experience.) Conversely, the rates for graduates in architecture and the arts were 13.9 and 11.1 percent, respectively.”

What is college for? (Part 2) [Gary Gutting/The New York Times].

Concretely, students graduating from high school should, to cite one plausible model, be able to read with understanding classic literature (from, say, Austen and Browning to Whitman and Hemingway) and write well-organized and grammatically sound essays; they should know the basic outlines of American and European history, have a good beginner’s grasp of at least two natural sciences as well as pre-calculus mathematics, along with a grounding in a foreign language.

Students with this sort of education would be excellent candidates for many satisfying and well-paying jobs in, for example, sales and service industries, except for those that require highly specialized skills. From the standpoint of employment, high school graduates would have no need of college unless they wanted to be accountants or engineers, pursue pre-professional programs leading to law or medical school or train for doctoral work in science or the humanities. Apart from this, the only good reason they would have for going to college would be for its intellectual culture.

Compelling idea, but seems unlikely to happen because (1) our high schools are mostly incapable of providing such an education and (2) our culture is overly invested in the idea of college as the basic ticket to success in today’s economy. E.g.: D.C. may require college application for all [Joanne Jacobs].

How data science is like magic

In The Magicians[1], Lev Grossman describes magic as it might exist, but he could as well be describing the real-world practice of statistical analysis or software development:

As much as it was like anything, magic was like a language. And like a language, textbooks and teachers treated it as an orderly system for the purposes of teaching it, but in reality it was complex and chaotic and organic. It obeyed rules only to the extent that it felt like it, and there were almost as many special cases and one-time variations as there were rules. These Exceptions were indicated by rows of asterisks and daggers and other more obscure typographical fauna which invited the reader to peruse the many footnotes that cluttered up the margins of magical reference books like Talmudic commentary.

It was Mayakovsky’s [the teacher's] intention to make them memorize all these minutiae, and not only to memorize them but to absorb and internalize them. The very best spellcasters had talent, he told his captive, silent audience, but they also had unusual under-the-hood mental machinery, the delicate but powerful correlating and cross-checking engines necessary to access and manipulate and manage this vast body of information. (p149)

To be a good data scientist, whether using traditional statistical techniques or machine learning algorithms (or both), you must know all the rules and approach it first as an orderly system. Then you begin to learn all the special cases and one-time variations and you study and study and practice and practice until you can almost unconsciously adjust to each unique situation that arises.

When I took ANOVA in my Ph.D. program, I could hardly believe there was entire course devoted to it. But it was much like Grossman’s description above. Each week we learned new special cases and one-time variations. I did ANOVA in so many different Circumstances that now I have absorbed and internalized its application as well as the design of studies that would usefully be analyzed with it or with some more flexible variation of it (e.g., hierarchical linear modeling). It felt cookbook at the beginning, but at the end of the course, I felt like I’d begun to develop that “unusual under-the-hood mental machinery” that Grossman suggested an effective magician in his imagined world would need.

That’s not to say that there aren’t important universal principles and practices and foundational knowledge to understand if you are to be an effective statistician or data miner or machine learner programmer; it’s not to say that awareness of Circumstances and methodical practice are all you need. It is to say that data science is ultimately a practice not a philosophy and you reach expertise in it through doing things over and over again, each time in slightly different ways.

In The Magicians, protagonist Quentin practices Legrand’s Hammer Charm, under thousands of different Circumstances:

Page by page the Circumstances listed in the book became more and more esoteric and counterfactual. He cast Legrand’s Hammer Charm at noon and at midnight, in summer and winter, on mountaintops and a thousand yards beneath the earth’s surface. He cast the spell underwater and on the surface of the moon. He cast it in early evening during a blizzard on a beach on the island of Mangareva, which would almost certainly never happen since Mangareva is part of French Polynesia, in the South Pacific. He cast the spell as a man, as a woman, and once–was this really relevant?–as a hermaphrodite. He cast it in anger, with ambivalence, and with bitter regret. (pp150-151)

Sometimes I feel like I have fit logistic regression in all these situations (perhaps not as a hermaphrodite). The next logistic regression I fit, I will say to myself “Wax on, wax off” as Quentin did when faced with a new spell that he had to practice according to each set of Circumstances.


[1]Highly recommended, but with caveats. Read it last summer — loved it — sent it to my 15-year-old son at camp. He loved it too and bought me the sequel for Christmas. After reading the second one, I had to re-read the first. It’s a polarizing book. Don’t pick it up if you are offended by heavy drinking, gratuitous sex, and a wandering plot. Do pick it up if you felt like your young adulthood was marked by heavy drinking, gratuitous sex, a wandering plot, and not nearly enough magic. My son tends to read adult books so I didn’t hesitate to share it with him, but it probably would not be appropriate for most teenagers.

Links for January 7, 2012

Nutrition advice: The vitamin D-lemma [Amy Maxmen/Nature]. “The difficulty of distilling strong advice from weak evidence.” This is a key challenge for researchers/statisticians/data scientists in any domain, not just in health.

Will Amazon offer analytics as a service? [Quentin Hardy/Bits]. Interesting to get an idea what that might look like. I don’t think, though, this would compete with SAS and similar software as the post implies. Would someone looking to implement a product recommendation engine implement it in SAS? Probably not. For example, Google is said to use R for model exploration and prototyping, then puts them into production using Python or C++. I feel a “choosing your analytics tool” post coming on.

Community college budget cuts drive students to for-profit school [Chris Kirkham/Huffington Post]. Balanced coverage of why students turn to for-profit schools and the pros and cons of such choices. My observation: community college tuition is artificially low due to government subsidization while for-profit tuition is artificially high, again because of government interference (in the form of financial aid). No market forces to bring about a reasonable balance between supply and demand. The big losers are students (and taxpayers).

Benchprep is codecademy for any subject, high school to med school [Josh Constine/TechCrunch]. “Eventually, publishers might get a clue that interactive digital education is going to destroy their paper book business. If they’re smart they’ll start developing their own courses or raise licensing fees. Until then though, BenchPrep will be the savior of anyone frustrated by the static book-learning experience.” I’m pretty certain some big textbook publishers see that already.

Forget dieting, try intermittent fasting [Josh Ozersky/Time Ideas]. “And that’s why instead of eating healthier, I’m going for longer stretches without eating so I can actually enjoy a whole meal. I don’t starve myself; I drink a protein shake if I get hungry and consume endless glasses of diet iced tea. People tell me this is bad, that I will soon gain back all the weight I’ve lost – and these rejoinders are always given with a smug malice, as if the people uttering them actually despise me for trying to compensate for the pleasures of the plate.”

I fast most days at work until about 2 or 3 pm, then have a small snack. I eat whatever I want once I get home from work around 5 pm. I find this allows me to eat generally what I want while maintaining my weight at a level I’m happy with. I have found, like Josh, that people get really upset about this plan, almost offended that I would eat this way. Funny how everyone thinks they know what is healthy and what is not, despite the difficulties in determining that (see first link in this post).

Links for December 30, 2011

Yes, and… [W.P. McNeill/Corner Cases]. Living by the “yes, and” ethos of improvisational comedy. Always build on what the other person said–stay open to their insight and direction. Be a pliable weed not a concrete pylon. Don’t get mired in dogma. I’m thinking this would work equally well in interactions with coworkers as with kids.

College has been oversold [Alex Tabarrok/Marginal Revolution]. The total number of students graduating from college is way up, but the numbers graduating with STEM degrees haven’t increased. That’s bad for individuals and bad for the economy. “An argument can be made for subsidizing students in fields with potentially large spillovers, such as microbiology, chemical engineering, nuclear physics and computer science. There is little justification for subsidizing sociology, dance and English majors.”

You have to break connections to get your ideas to spread [Tim Kastelle/Innovation Leadership Network. Innovation requires disruption. "When you come up with a great new idea, you need to think about this economic network in two ways. The first is: how can I connect to all of the complementary parts of the economy that are needed to get my idea to work? The second is: if I’m going to get my idea to spread, which of these existing connections need to be broken?"

The second economy [W. Brian Arthur/McKinsey Quarterly]. We are in the process of building out the economy’s neural system, what Arthur calls “the second economy” growing up alongside the first economy, the industrial economy. Downside: loss of jobs as computers take over.

Selecting amongst large classes of models [Brian D. Ripley] (pdf). We have the data and the computational resources to “trawl through literally thousands of models (and in some cases many more).” How to pick among them? A subject I intend to learn a lot more about in 2012.

Curing the big data storage fetish [Dan Woods/Forbes]. “One popular way to express lust for big data for its own sake is to create a gargantuan Hadoop cluster.” Not enough to just store the data, need to build a data-driven culture. “But how do you create  a company culture like CapitalOne or Google or eBay or Zynga or LinkedIn, where data is essentially part of the management team? At all of these companies there are data scientists, the elite professionals, but there are also swarms of data enthusiasts, people who are eager to use data to help do their jobs better.”

Honey, I shrunk the statistician

In my Ph.D. program, I learned all about how to analyze small data. I learned rules of thumb for how much data I needed to run a particular analysis, and what to do if I didn’t have enough. I worked with what seemed now (and even seemed then) to be toy data sets, except they weren’t toys, because when you’re running a psychological experiment you might be lucky to have 30 participants and when you’re analyzing the whole world’s math performance (think TIMSS), you can do it with a data set less than a gigabyte in size.

image courtesy Victorian Web

I did some program evaluation while I finished my degree and sometimes colleagues would lament, “we don’t have enough data!” Usually, we did. We could thank Sir Ronald Aylmer Fisher for that. He worked with small data in agricultural settings and gave us the venerable ANOVA technique, which works fine with just a handful of cases per group (assuming balanced group sizes, normality, and homoscedasticity). Maybe we might give a nod to William Sealy Gosset, too, for introducing Student’s t distribution, which helps when the Central Limit Theorem hasn’t kicked in yet and brought us to normality.

But Sir Ronald and Student can’t help me now. I’m down a rabbit hole… in some sort of web-scale wonderland of big data. I feel like Alice after drinking the magic potion, too small to reach the key on the table. The data is so much broader and bigger than I am, so much broader and bigger than my puny methods and my puny desktop R environment that wants to suck everything into memory in order to analyze it.

I stay awake at night thinking how to analyze all this data and deliver on its promise, how to analyze across schools and courses and so many, many students, not to mention all their clickstreams. How can I get through the locked door and experience the rest of wonderland when I’m so small and the data’s so big? I could sample the data, I think, and then I’d be in the realm where I’m comfortable, dealing with sampling distributions and generalizing to a population and applying the small-data methods I know already. Perhaps I can extract it by subset — by subsets of like courses, perhaps, or by school (I’m doing that already–not scalable and doesn’t address some of the most interesting questions). What about trying out Revolutions’ big data support for R? Or maybe I can apply haute big-data techniques: Hadoop-ify it (Hive, Pig, HBase???) then use simplistic (embarrassingly parallel) algorithms with MapReduce. Problem is, none of the methods I like to use and seem appropriate for educational settings (multilevel modeling for example) are easily parallelized. I’m stumped.

It’s okay to be stumped, I think — part of creation is living with uncertainty:

Most people who are not consummate creators avoid tension. They want quick answers. They don’t like living in the realm of not knowing something they want to know. They have an intolerance for those moments in the creative process in which you have no idea how to get from where you are to where you want to be. Actually, this is one of the very best moments there are. This is when something completely original can be born, when you go beyond your usual ways of addressing similar situations, where you can drive the creative process into high gear. [Robert Fritz on supercharging the creative process]

Alice ate some cake that made her bigger. Is there some cake that will make me and my methods big enough to answer the questions I want answered? For now I’m in the realm of not knowing but I hope in 2012 I will have some answers: first, answers about how to make myself big again, and second, answers from the data.