Category Archives: data science

Putting the science in data science

Data science is not just overhyped marketing BS, at least not if you are doing it right.

Owning up to the title of data scientist [Sean McClure | Data Science Central]:

To own up to the title of data scientist means practitioners, vendors and organizations must be held accountable to using the term science, just as is expected from every other scientific discipline. What makes science such a powerful approach to discovery and prediction is the fact that its definition is fully independent of human concerns. Yes, we apply science to the areas we are interested in, and are not immune to bias and even falsification of results. But these deviations of the practice do not survive the scientific approach. They are weeded out by the self-consistent and testable mechanisms that underly the scientific method. There is a natural momentum to science that self-corrects and its ability to do this is fully understandable because what survives is the truth. The truth, whether inline with our wishes or not, is simply the way the world works.

Opinions, tools of the trade, programing languages and ‘best’ practices come and go, but what alway survives is the underlying truth that governs how complex systems operate. That ‘thing’ that does work in real world settings. That concept that does explain the behavior with enough predictive accuracy to solve challenges and help organizations compete. This requires discovery; not engineered systems, business acumen, or vendor software. Those toolsets and approaches are only as powerful as the science that drives their execution and provides them their modeled behavior. It is not a product that defines data science, but an intangible ability to conduct quality research that turns raw resources into usable technology.

Why are we doing this? To make our software better – to help it learn about the world and then, based on that learning, improve business outcomes:

The software of tomorrow isn’t programming ‘simple’ logic into machines to produce some automated output. It is using probabilistic approaches and numerical and statistical methods to ‘learn’ the behavior and act accordingly. The software of tomorrow is aware of the market in which it operates and takes actions that are inline with the models sitting under its hood; models that have been built from intense research on some underlying phenomenon that the software interacts with. Science is now being called upon to be a directly-involved piece of real-world products and for that reason, like never before in history, the demand for ushering in science to help enterprise compete is exploding.

Any time someone equates data science with storytelling I get worked up. Science is not storytelling and neither is data science. There is science to figuring out how the world works and how to make things better based on knowing how it works.

Can you toolify away the data scientists?

From GigaOM:

Machine learning startup Wise.io, whose founders are University of California, Berkeley, astrophysicists, has raised a $2.5 million series A round of of venture capital. The company claims its software, which was built to analyze telescope imagery, can simplify the process of predicting customer behavior.

And from Wise.io’s website:

Read how one fast growing Internet company turned to Wise to get a Lead Scoring Application, scrapped their plans to hire a Data Scientist and replaced their custom code with an end-to-end Leading scoring Application in a couple of weeks.

Uh oh. Pretty soon no one’s going to hire data scientists!*

[Aside: Editors will still be needed. How about “lead scoring application” instead of “Leading scoring Application?” Heh. Aside to the aside: predictive lead scoring is among the easiest of data science problems currently confronting humans.]

But it’s not that easy to “toolify” data analysis. We need human data scientists for now, because humans are able to:

  • Frame problems
  • Design features
  • Unfool themselves

What we’re talking about here is freestyle data science – humans supported with advanced data science tools. [See: Tyler Cowen on freestyle chess.]

Not just any old humans. Humans who know how to do data science.

Frame problems
Data science projects do not arrive on an analyst’s desk like graduate school data analysis projects, with the question you are supposed to answer given to you. Instead, data scientists work with business experts to identify areas potentially amenable to optimization with data-based approaches. Then the data scientist does the difficult work of turning an intuition about how data might help into a machine learning problem. She formulates decision problems in terms of expected value calculations. She selects machine learning algorithms that may be useful. She figures out what data might be relevant. She gathers it and preps it (see below) and explores it.

Design features
Just as real-world data science projects do not arrive with a neat question and preselected algorithm preformulated, they also do not arrive with variables all prepped and ready. You start with a set of transaction records (or file system full of logs… or text corpus… or image database) and then you think, “O human self, how can I make this mess of numbers and characters and yeses and nos representative of what’s going on in the real world?” You might log-transform or bin currency amounts. You might need to think about time and how to represent changing trends (exponentially weighted moving average?) You might distill your data – mapping countries to regions for example. You might enrich it – gathering firmographics from Dun & Bradstreet.

You also must think about missing values and outliers. Do you ignore them? Impute and replace them? Machines can be given instruction in how to handle these but they may do it crudely, without understanding the problem domain and without being able to investigate why certain values are missing or outlandish.

Unfool themselves
We humans have an uncanny ability to fool ourselves. And in data analysis, it seems easier than ever to do so. We fool ourselves with leakage. We fool ourselves by cross-validating with correlated data. We fool ourselves when we think a just-so story captures truth. We fool ourselves when we think that big data means we can analyze everything and we don’t need to sample.

[Aside: We are always sampling. We sample in time. We sample from confluences of causes. Analyzing all the data that exists doesn’t mean that we have analyzed all the data that a particular data generation mechanism could generate.]

What humans can do is unfool ourselves. Machines cannot do that because they don’t understand the world well enough. How do we do so? By careful thought about the problem domain and questioning of too-good predictive results. With carefully designed observational studies. With carefully designed and executed experiments.

Care: that’s what humans bring to machine learning and machines do not. We care about the problem domain and the question we want to answer. We care about developing meaningful measures of things going on in that domain. We care about ensuring we don’t fool ourselves.

* Even if we can toolify data science and I have no doubt we will move ever towards that, the tool vendors will still need data scientists. I predict continued full employment. But I may be fooling myself.