Tag Archives: data mining

Links for February 22, 2012

Elizabeth Gilbert on What the Porcupine Dilemma Can Teach Us About the Secret of Happiness [Maria Popova/Brain Pickings]. Elizabeth Gilbert on Schopenhauer’s porcupines. Staying warm without impaling yourself on someone else’s spines.

Target, Pregnancy, and Predictive Analytics, Part II [Dean Abbott/Data Mining and Predictive Analytics. The Target story was interesting for what it says about the possibilities and perils of analytics. This was my favorite writeup, for its overview of to succeed with data analysis:

1) understand the data,
2) understand why the models are focusing on particular input patterns,
3) ask lots of questions (why does the model like these fields best? why not these other fields?)
4) be forensic (now that’s interesting or that’s odd…I wonder…),
5) be prepared to iterate, (how can we predict better for those customers we don’t characterize well)
6) be prepared to learn during the modeling process

We have to “notice” patterns in the data and connect them to behavior. This is one reason I like to build multiple models: different algorithms can find different kinds of patterns. Regression is a global predictor (one continuous equation for all data), whereas decision trees and kNN are local estimators.

You Are Responsible for Getting Your Ideas to Spread [Tim Kastelle/Innovation Leadership Network]. Don’t blame the customer if your idea isn’t compelling; that’s a failure of your idea or your communication of it.

Machine Learning for Hackers [Review from David Smith/Revolution Analytics blog]. Sounds like a book I need to order.

Rather than merely providing a “cookbook” approach to say, building a “who to follow” recommendation system for Twitter, it takes the time to explain the methodology behing the algorithms and give the reader a better basis for understanding why these methods work (and, equally importantly, how they can go wrong).

What’s new? Exuberance for novelty has benefits [John Tierney/The New York Times]. In a longitudinal study, people who combined novelty-seeking with persistence and “self-transcendence” showed the most success over the years (good health, lots of friends, few emotional problems, greatest satisfaction with life).

How data science is like magic

In The Magicians[1], Lev Grossman describes magic as it might exist, but he could as well be describing the real-world practice of statistical analysis or software development:

As much as it was like anything, magic was like a language. And like a language, textbooks and teachers treated it as an orderly system for the purposes of teaching it, but in reality it was complex and chaotic and organic. It obeyed rules only to the extent that it felt like it, and there were almost as many special cases and one-time variations as there were rules. These Exceptions were indicated by rows of asterisks and daggers and other more obscure typographical fauna which invited the reader to peruse the many footnotes that cluttered up the margins of magical reference books like Talmudic commentary.

It was Mayakovsky’s [the teacher’s] intention to make them memorize all these minutiae, and not only to memorize them but to absorb and internalize them. The very best spellcasters had talent, he told his captive, silent audience, but they also had unusual under-the-hood mental machinery, the delicate but powerful correlating and cross-checking engines necessary to access and manipulate and manage this vast body of information. (p149)

To be a good data scientist, whether using traditional statistical techniques or machine learning algorithms (or both), you must know all the rules and approach it first as an orderly system. Then you begin to learn all the special cases and one-time variations and you study and study and practice and practice until you can almost unconsciously adjust to each unique situation that arises.

When I took ANOVA in my Ph.D. program, I could hardly believe there was entire course devoted to it. But it was much like Grossman’s description above. Each week we learned new special cases and one-time variations. I did ANOVA in so many different Circumstances that now I have absorbed and internalized its application as well as the design of studies that would usefully be analyzed with it or with some more flexible variation of it (e.g., hierarchical linear modeling). It felt cookbook at the beginning, but at the end of the course, I felt like I’d begun to develop that “unusual under-the-hood mental machinery” that Grossman suggested an effective magician in his imagined world would need.

That’s not to say that there aren’t important universal principles and practices and foundational knowledge to understand if you are to be an effective statistician or data miner or machine learner programmer; it’s not to say that awareness of Circumstances and methodical practice are all you need. It is to say that data science is ultimately a practice not a philosophy and you reach expertise in it through doing things over and over again, each time in slightly different ways.

In The Magicians, protagonist Quentin practices Legrand’s Hammer Charm, under thousands of different Circumstances:

Page by page the Circumstances listed in the book became more and more esoteric and counterfactual. He cast Legrand’s Hammer Charm at noon and at midnight, in summer and winter, on mountaintops and a thousand yards beneath the earth’s surface. He cast the spell underwater and on the surface of the moon. He cast it in early evening during a blizzard on a beach on the island of Mangareva, which would almost certainly never happen since Mangareva is part of French Polynesia, in the South Pacific. He cast the spell as a man, as a woman, and once–was this really relevant?–as a hermaphrodite. He cast it in anger, with ambivalence, and with bitter regret. (pp150-151)

Sometimes I feel like I have fit logistic regression in all these situations (perhaps not as a hermaphrodite). The next logistic regression I fit, I will say to myself “Wax on, wax off” as Quentin did when faced with a new spell that he had to practice according to each set of Circumstances.


[1]Highly recommended, but with caveats. Read it last summer — loved it — sent it to my 15-year-old son at camp. He loved it too and bought me the sequel for Christmas. After reading the second one, I had to re-read the first. It’s a polarizing book. Don’t pick it up if you are offended by heavy drinking, gratuitous sex, and a wandering plot. Do pick it up if you felt like your young adulthood was marked by heavy drinking, gratuitous sex, a wandering plot, and not nearly enough magic. My son tends to read adult books so I didn’t hesitate to share it with him, but it probably would not be appropriate for most teenagers.