books, links

Links for February 22, 2012

Elizabeth Gilbert on What the Porcupine Dilemma Can Teach Us About the Secret of Happiness [Maria Popova/Brain Pickings]. Elizabeth Gilbert on Schopenhauer’s porcupines. Staying warm without impaling yourself on someone else’s spines.

Target, Pregnancy, and Predictive Analytics, Part II [Dean Abbott/Data Mining and Predictive Analytics. The Target story was interesting for what it says about the possibilities and perils of analytics. This was my favorite writeup, for its overview of to succeed with data analysis:

1) understand the data,
2) understand why the models are focusing on particular input patterns,
3) ask lots of questions (why does the model like these fields best? why not these other fields?)
4) be forensic (now that’s interesting or that’s odd…I wonder…),
5) be prepared to iterate, (how can we predict better for those customers we don’t characterize well)
6) be prepared to learn during the modeling process

We have to “notice” patterns in the data and connect them to behavior. This is one reason I like to build multiple models: different algorithms can find different kinds of patterns. Regression is a global predictor (one continuous equation for all data), whereas decision trees and kNN are local estimators.

You Are Responsible for Getting Your Ideas to Spread [Tim Kastelle/Innovation Leadership Network]. Don’t blame the customer if your idea isn’t compelling; that’s a failure of your idea or your communication of it.

Machine Learning for Hackers [Review from David Smith/Revolution Analytics blog]. Sounds like a book I need to order.

Rather than merely providing a “cookbook” approach to say, building a “who to follow” recommendation system for Twitter, it takes the time to explain the methodology behing the algorithms and give the reader a better basis for understanding why these methods work (and, equally importantly, how they can go wrong).

What’s new? Exuberance for novelty has benefits [John Tierney/The New York Times]. In a longitudinal study, people who combined novelty-seeking with persistence and “self-transcendence” showed the most success over the years (good health, lots of friends, few emotional problems, greatest satisfaction with life).

books, statistics

How data science is like magic

In The Magicians[1], Lev Grossman describes magic as it might exist, but he could as well be describing the real-world practice of statistical analysis or software development:

As much as it was like anything, magic was like a language. And like a language, textbooks and teachers treated it as an orderly system for the purposes of teaching it, but in reality it was complex and chaotic and organic. It obeyed rules only to the extent that it felt like it, and there were almost as many special cases and one-time variations as there were rules. These Exceptions were indicated by rows of asterisks and daggers and other more obscure typographical fauna which invited the reader to peruse the many footnotes that cluttered up the margins of magical reference books like Talmudic commentary.

It was Mayakovsky’s [the teacher’s] intention to make them memorize all these minutiae, and not only to memorize them but to absorb and internalize them. The very best spellcasters had talent, he told his captive, silent audience, but they also had unusual under-the-hood mental machinery, the delicate but powerful correlating and cross-checking engines necessary to access and manipulate and manage this vast body of information. (p149)

To be a good data scientist, whether using traditional statistical techniques or machine learning algorithms (or both), you must know all the rules and approach it first as an orderly system. Then you begin to learn all the special cases and one-time variations and you study and study and practice and practice until you can almost unconsciously adjust to each unique situation that arises.

When I took ANOVA in my Ph.D. program, I could hardly believe there was entire course devoted to it. But it was much like Grossman’s description above. Each week we learned new special cases and one-time variations. I did ANOVA in so many different Circumstances that now I have absorbed and internalized its application as well as the design of studies that would usefully be analyzed with it or with some more flexible variation of it (e.g., hierarchical linear modeling). It felt cookbook at the beginning, but at the end of the course, I felt like I’d begun to develop that “unusual under-the-hood mental machinery” that Grossman suggested an effective magician in his imagined world would need.

That’s not to say that there aren’t important universal principles and practices and foundational knowledge to understand if you are to be an effective statistician or data miner or machine learner programmer; it’s not to say that awareness of Circumstances and methodical practice are all you need. It is to say that data science is ultimately a practice not a philosophy and you reach expertise in it through doing things over and over again, each time in slightly different ways.

In The Magicians, protagonist Quentin practices Legrand’s Hammer Charm, under thousands of different Circumstances:

Page by page the Circumstances listed in the book became more and more esoteric and counterfactual. He cast Legrand’s Hammer Charm at noon and at midnight, in summer and winter, on mountaintops and a thousand yards beneath the earth’s surface. He cast the spell underwater and on the surface of the moon. He cast it in early evening during a blizzard on a beach on the island of Mangareva, which would almost certainly never happen since Mangareva is part of French Polynesia, in the South Pacific. He cast the spell as a man, as a woman, and once–was this really relevant?–as a hermaphrodite. He cast it in anger, with ambivalence, and with bitter regret. (pp150-151)

Sometimes I feel like I have fit logistic regression in all these situations (perhaps not as a hermaphrodite). The next logistic regression I fit, I will say to myself “Wax on, wax off” as Quentin did when faced with a new spell that he had to practice according to each set of Circumstances.

[1]Highly recommended, but with caveats. Read it last summer — loved it — sent it to my 15-year-old son at camp. He loved it too and bought me the sequel for Christmas. After reading the second one, I had to re-read the first. It’s a polarizing book. Don’t pick it up if you are offended by heavy drinking, gratuitous sex, and a wandering plot. Do pick it up if you felt like your young adulthood was marked by heavy drinking, gratuitous sex, a wandering plot, and not nearly enough magic. My son tends to read adult books so I didn’t hesitate to share it with him, but it probably would not be appropriate for most teenagers.


Campbell et al. on experimentation and quasi-experimentation

Ph.D. Topics : Research and Evaluation Methods

For my Ph.D. comprehensive exam, I not only have to respond thoroughly and knowledgeably to essay questions, I need to cite sources. This part of academic life feels odd to me, this reliance on citing someone else rather than making a good argument. I attended a dissertation defense spring quarter and found it strange that the defender spent a lot of time citing this or that book or article rather than actually intellectually arguing for particular positions. I guess when you’re talking about SEM fit index cutoffs that makes some sense, as one of the best intellectual arguments for them may be the results of a simulation study. But in many other cases, I think you’d want to back up your citation with some rhetoric.

I do agree you need both: you need expert works you can cite and you need to make good arguments. Anyway, if I want to pass my comps, I must learn and memorize the key authorities and works to cite. Ideally I would read and study all these works myself but in absence of the time to do that, at least I can learn more about them than just the authors and dates. It would feel intellectually dishonest to me to cite these works without having a really good idea what they are about and what is in them.

As far as experimental and quasi-experimental design goes, the key authority is clearly Donald Campbell and the three works I see cited over and over are:

  • Campbell, D. & Stanley, J. (1963). Experimental and quasi-experimental designs for research. Chicago, IL: Rand-McNally.
  • Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design and analysis issues for field settings. Boston, MA: Houghton Mifflin Company.
  • Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimental designs for generalized causal inference. Boston, MA: Houghton Mifflin.

These are actually three versions of the same seminal work that began with a book chapter in 1963, was published as a small book in 1966, “greatly expanded” in 1979, and issued in a new edition in 2002, that is “encyclopedic in its coverage” (Rosenthal & Rosnow, 2008).

Campbell & Stanley (1963, 1966) introduced the terms internal validity and external validity while the later Cook & Campbell (1979) edition added statistical conclusion validity and construct validity (Rosenthal & Rosnow, 2008). The first version of this work also introduced the term quasi-experiment.

Here is a pdf of chapters 1 and 14 from the 2002 edition, covering general topics in causation and experimentation as well as a self-critique of their work. I think I’ll print it out and read it.


Rosenthal, R. & Rosnow, R.L. (2008). Essentials of Behavioral Research: Methods and Data Analysis. Boston: McGraw Hill.


The trick of being a man

What would I know about being a man? Not much, but I have watched the men around me, and this sounds right:

This is an essential element of the business of being a man: to flood everyone around you in a great radiant arc of bullshit, one whose source and object of greatest intensity is yourself. To behave as if you have everything firmly under control even when you have just sailed your boat over the falls. “To keep your head,” wrote Rudyard Kipling in his classic poem “If,” which articulated the code of high-Victorian masculinity in whose fragmentary shadow American men still come of age, “when all about you are losing theirs”; but in reality, the trick of being a man is to give the appearance of keeping your head when, deep inside, the truest part of you is crying out, Oh, shit!

(from Manhood for Amateurs by Michael Chabon)

I envy men their ability to do this, to flood everyone around them in a great radiant arc of bullshit. It’s something I’d like to learn to do myself.

books, statistics

A master’s degree in statistics, in book form

Peter Kennedy’s A Guide to Econometrics covers pretty much everything I’ve forgotten from the master’s degree in statistics I did a few (!!) years ago. And it does so with just the right combination of sophistication and simplicity.

This guide tackles topics ranging from criteria for estimators (least squares, unbiasedness, maximum likelihood, asymptotics, etc) to what do to when standard assumptions of linear regression are violated to Bayesian approaches to robust estimation.

I have just one quibble. Kennedy says, “What distinguishes an econometrician from a statistician is the former’s preoccupation with problems caused by violations of statisticians’ standard assumptions; owing to the nature of economic relationships and the lack of controlled experimentation, these assumptions are seldom met.”

I don’t agree that this is what separates econometricians from statisticans; many, perhaps most, statisticians deal with observational studies not experimental ones (especially in social science). All of us with our hands on data have to know the assumptions of our methods and know what to do when they are violated. That’s one reason this book is so useful: violation of each important assumption merits its own chapter instead of getting buried in a description of a particular method or in an afterthought section on testing assumptions.

Very cool; highly recommended.


Drive book review: Extrinsic motivation matters too

Dan Pink’s Drive rehashes what we’ve already heard from Alfie Kohn and others: intrinsic motivation rocks; extrinsic sucks.

But Pink left out a crucial piece of Deci and Ryan’s self-determination theory. Extrinsic motivation is not uni-dimensional. Deci and Ryan identified a number of different types of extrinsic motivation (Guay, Vallerand, & Blanchard, 2000):

  • External regulation — doing something because someone else told you to or motivated you with external rewards.
  • Introjected regulation — internal pressure such as shame or guilt or wanting self-approval leads you to do something.
  • Identified regulation — doing something because you see personal value in it. Note you still might not be intrinsically motivated by the activity; it might not be very intrinsically fun or otherwise rewarding.
  • Integrated regulation — the various regulations/motivations you feel are harmoniously integrated.

Many psychological experiments on motivation assume that external regulation exhausts the domain of extrinsic motivation. Experimenters will use external rewards to get people to do something then notice that the participants’ intrinsic motivation subsequently declines. The researchers conclude “extrinsic motivation bad, intrinsic good.” The most famous example of this is the drawing experiment with preschoolers.

Most of us adults do a bunch of things that we aren’t intrinsically motivated to do. For example, I do my survey research homework not because I enjoy it (I find the subject utterly tedious) but I know that to achieve the expertise in social science research that I want, I need to know about survey research. Just because doing that homework is extrinsically motivated doesn’t mean my motivation to do it is on the decline.

As Ryan and Deci (2000) point out, “much of what people do is not, strictly speaking, instrinsically motivated, especially after early childhood when the freedom to be intrinsically motivated is increasingly curtailed by social pressures to do activities that are not interesting and to assume a variety of new responsibilities.”

Pink’s Drive largely ignores that adults get up in the morning and do what they do mostly out of extrinsic motivation — and that’s not a bad thing, especially if you can fix it up so your work involves identified regulation or integrated regulation.


Guay, F., Vallerand, R.J., & Blanchard, C. (2000). On the assessment of situational intrinsic and extrinsic motivation: The Situational Motivation Scale (SIMS). Motivation and Emotion 24(3).

Ryan, R.M. & Deci, E.L. (2000). Self-determination theory and the facilitation of intrinsic motivation, social development, and well-being. American Psychologist 55(1).