Daily Links 04/08/2014

Cheerleaders for big data have made four exciting claims, each one reflected in the success of Google Flu Trends: that data analysis produces uncannily accurate results; that every single data point can be captured, making old statistical sampling techniques obsolete; that it is passé to fret about what causes what, because statistical correlation tells us what we need to know; and that scientific or statistical models aren’t needed because, to quote “The End of Theory”, a provocative essay published in Wired in 2008, “with enough data, the numbers speak for themselves”.

Apache Mahout, Hadoop’s original machine learning project, is moving on from MapReduce — Tech News and Analysis

While data processing in Hadoop has traditionally been done using MapReduce, the batch-oriented framework has fallen out of vogue as users began demanding lower-latency processing for certain types of workloads — such as machine learning. However, nobody really wants to abandon Hadoop entirely because it’s still great for storing lots of data and many still use MapReduce for most of their workloads. Spark, which was developed at the University of California, Berkeley, has stepped in to fill that void in a growing number of cases where speed and ease of programming really matter.

Some diseases and conditions have a long lead time, developing so slowly, they may never actively threaten our health. When we’re diagnosed early with those, we’ve “had” the disease for longer, but we don’t live longer. All we’ve done is increase our “disease survival” by shortening the “disease-free” part of our lives.


Daily Links 03/23/2014

The gig economy (a phrase which encompasses both the related collaborative economy and sharing economy) represents a theory of the future of work that’s a viable alternative to laboring for corporate America. Instead of selling your soul to the Man, it goes, you are empowered to work for yourself on a project-by-project basis. One day it might be delivering milk, but the next it’s building Ikea furniture, driving someone to the airport, hosting a stranger from out of town in your spare bedroom, or teaching a class on a topic in which you’re an expert. The best part? The work will come to you, via apps on your smartphone, making the process of finding work as easy as checking your Twitter feed.

Here’s how it works: Say you are a startup founder who’s looking to meet with an Android app developer. You sign up for the service by logging in with your LinkedIn profile, selecting the topics that interest you (in this case probably startups, programming, etc.), and choosing your favorite coffee shops. And just like how you update your Facebook status, you can write, “I’m looking to meet with an Android app developer.” The message will go public, and anyone on the site can respond to you.

Divergent, Harry Potter, and YA fiction’s desire for self-categorization.

Of course, those dopamine-dazzled brains are still maturing, so a full command of abstract thought remains out of reach. Which means that teenagers can be the most literal-minded people you’ll ever meet in your life. Self-invention is hard, and it helps to have a set blueprint. Enter labels and stereotypes: the Gryffindors, Givers, and Geeks who turn the chaotic terrarium of high school into a taxonomist’s paradise.

Daily Links 03/20/2014

There is also considerable evidence that when we are wrestling with decisions that involve many factors, simply concentrating on the problem analytically in the foreground of our thinking — focusing — does not necessarily lead to the best outcomes.

One surprise from the panel is that many self-identified data professionals can’t actually do anything useful with the information they hold. According to Hilary Mason, a data scientist at Accel Partners, some companies are forgetting to employ the scientific method.

“Companies say they’re data driven, but they use it poorly…They lack an experimental process,” said Mason, adding that if a startup appears unable to extract value from its own data set, it’s unlikely anyone else is going to want to invest in that data.

Daily Links 03/18/2014

Paco Nathan’s answer to Data Analysis: What is the difference between Data Analytics, Data Analysis, Data Mining and Data Science? – Quora

The challenges circa 2013-fwd, given the rise of “Internet of Things”, real-world sensor grids, etc., are much different than the kind of math and systems used at Google, Twitter, Facebook, etc. Think about it: the global market segment for tractors and heavy equipment is valued at 1/4 Trillion US$, and when those kinds of businesses embrace data science, they will not be building social networks. They will be performing massive scale linear algebra for applying optimization theory to save billions on complex supply chain problems. So the people hired as “BI experts” at Yahoo! in 2000 will have almost no place, and moreover the people hired as “data scientists” at Facebook in 2010 will likely have a diminished role. The example of tractors is given as a “brick and mortar” contrast to Facebook, Google, etc., and yet it is quite real since tractors are becoming drones powered by satellite networks and remote clustered computing, well beyond petabyte scale.

Meet the Married Duo Behind Tech’s Biggest New Harassment Scandal



Currently evaluating Knime for its ability to do replicable analytical workflows.

Can you toolify away the data scientists?

From GigaOM:

Machine learning startup Wise.io, whose founders are University of California, Berkeley, astrophysicists, has raised a $2.5 million series A round of of venture capital. The company claims its software, which was built to analyze telescope imagery, can simplify the process of predicting customer behavior.

And from Wise.io’s website:

Read how one fast growing Internet company turned to Wise to get a Lead Scoring Application, scrapped their plans to hire a Data Scientist and replaced their custom code with an end-to-end Leading scoring Application in a couple of weeks.

Uh oh. Pretty soon no one’s going to hire data scientists!*

[Aside: Editors will still be needed. How about "lead scoring application" instead of "Leading scoring Application?" Heh. Aside to the aside: predictive lead scoring is among the easiest of data science problems currently confronting humans.]

But it’s not that easy to “toolify” data analysis. We need human data scientists for now, because humans are able to:

  • Frame problems
  • Design features
  • Unfool themselves

What we’re talking about here is freestyle data science – humans supported with advanced data science tools. [See: Tyler Cowen on freestyle chess.]

Not just any old humans. Humans who know how to do data science.

Frame problems
Data science projects do not arrive on an analyst’s desk like graduate school data analysis projects, with the question you are supposed to answer given to you. Instead, data scientists work with business experts to identify areas potentially amenable to optimization with data-based approaches. Then the data scientist does the difficult work of turning an intuition about how data might help into a machine learning problem. She formulates decision problems in terms of expected value calculations. She selects machine learning algorithms that may be useful. She figures out what data might be relevant. She gathers it and preps it (see below) and explores it.

Design features
Just as real-world data science projects do not arrive with a neat question and preselected algorithm preformulated, they also do not arrive with variables all prepped and ready. You start with a set of transaction records (or file system full of logs… or text corpus… or image database) and then you think, “O human self, how can I make this mess of numbers and characters and yeses and nos representative of what’s going on in the real world?” You might log-transform or bin currency amounts. You might need to think about time and how to represent changing trends (exponentially weighted moving average?) You might distill your data – mapping countries to regions for example. You might enrich it – gathering firmographics from Dun & Bradstreet.

You also must think about missing values and outliers. Do you ignore them? Impute and replace them? Machines can be given instruction in how to handle these but they may do it crudely, without understanding the problem domain and without being able to investigate why certain values are missing or outlandish.

Unfool themselves
We humans have an uncanny ability to fool ourselves. And in data analysis, it seems easier than ever to do so. We fool ourselves with leakage. We fool ourselves by cross-validating with correlated data. We fool ourselves when we think a just-so story captures truth. We fool ourselves when we think that big data means we can analyze everything and we don’t need to sample.

[Aside: We are always sampling. We sample in time. We sample from confluences of causes. Analyzing all the data that exists doesn't mean that we have analyzed all the data that a particular data generation mechanism could generate.]

What humans can do is unfool ourselves. Machines cannot do that because they don’t understand the world well enough. How do we do so? By careful thought about the problem domain and questioning of too-good predictive results. With carefully designed observational studies. With carefully designed and executed experiments.

Care: that’s what humans bring to machine learning and machines do not. We care about the problem domain and the question we want to answer. We care about developing meaningful measures of things going on in that domain. We care about ensuring we don’t fool ourselves.

* Even if we can toolify data science and I have no doubt we will move ever towards that, the tool vendors will still need data scientists. I predict continued full employment. But I may be fooling myself. 

Daily Links 03/17/2014

These simple examples outline the heart of the problem with data:  interpretation.  Data by itself is of little value.  It is only when it is interpreted and understood that it begins to become information.  GovTech recently wrote an article outlining why search engines will not likely replace actual people in the near future.  If it were merely a question of pointing technology at the problem, we could all go home and wait for the Answer to Everything.  But, data doesn’t happen that way.  Data is very much like a computer:  it will do just as it’s told.  No more, no less.  A human is required to really understand what data makes sense and what doesn’t.  But, even then, there are many failed projects.

The Dunning–Kruger effect is a cognitive bias in which unskilled individuals suffer from illusory superiority, mistakenly rating their ability much higher than is accurate. This bias is attributed to a metacognitive inability of the unskilled to recognize their ineptitude.[1] Actual competence may weaken self-confidence, as competent individuals may falsely assume that others have an equivalent understanding.

After thinking, reading, discussing, and musing about personalization for about a year, I realized that there is a fine line between useful personalization and creepy personalization. It reminded me of the “uncanny valley” in human robotics. So I plotted the same kind of curves on two axes: Access to Data as the horizontal axis, and Perceived Helpfulness on the vertical axis.  For technology to get vast access to data AND make it past the invasive valley, it would have to be perceived as very high on the perceived helpfulness scale.

For a number of reasons, I don’t think that you can “toolify” data analysis that easily. I wished it would be, but from my hard-won experience with my own work and teaching people this stuff, I’d say it takes a lot of experience to be done properly and you need to know what you’re doing. Otherwise you will do stuff which breaks horribly once put into action on real data.

The opinions of other people matter, but they are the traps we set for ourselves. To get past our collective prison of self doubt – am I doing the right thing? Do I even know what the right thing is any more? – concentrate on the daily routine of doing what you enjoy, what you believe in, what you find intrinsically satisfying.

Daily Links 3/16/2014

Without a good router to provide reliable Wi-Fi, your Dropbox file-sharing application is not going to sync; without Nvidia’s graphics processing unit, your BuzzFeed GIF is not going to make anyone laugh. The talent — and there’s a ton of it — flowing into Silicon Valley cares little about improving these infrastructural elements. What they care about is coming up with more web apps.

Instead of drawing focus to how girls don’t fit the traditional male model of leadership, maybe we should look at how that model of leadership is changing, as is the world around it. It might even be time to encourage certain boys to be less bossy, that the way to increase your power is to assume the mental vantage point of the least powerful person in the room. And just as we shouldn’t criticize girls for acting quote-unquote ‘masculine’ (“that bitch”) we shouldn’t criticize boys for acting quote-unquote ‘feminine’ (“that pussy”). It might help to recognize that the most powerful and beloved leaders can tune into the crowd and move it from the inside out, combining traits that are associated with both genders.