Category Archives: data science

Daily Links 04/11/2017

Demystifying data science

The key to a successful analytical model is having a robust set of variables against which to test for their predictive capabilities. And the key to having a robust set of variables from which to test is to get the business users engaged early in the process.

How machine learning is shaking up e-commerce and customer engagement

From a content perspective, [Sitecore] performs semantic analysis to:

  • Auto generate taxonomies and tagging
  • Help improve the tone of your content by analyzing for things like wordiness, slang, and other grammar-like faux pax

From a digital marketing perspective, ML can:

  • Help detect segments of your customers or audience
  • Improve the effectiveness of your testing and optimization processes
  • Provide content and product recommendations that increase the engagement time a customer spends on your website.

And from a backend perspective, it can help with fraud detection, something that every company with an e-commerce model needs to monitor actively.

Gartner 2017 magic quadrant for data science platforms: gainers and losers

Firms covered:

  • Leaders (4): IBM, SAS, RapidMiner, KNIME
  • Challengers (4): MathWorks (new), Quest (formerly Dell), Alteryx, Angoss
  • Visionaries (5): Microsoft, (new), Dataiku (new), Domino Data Lab (new), Alpine Data
  • Niche Players (3): FICO, SAP, Teradata (new)

Gartner notes that even the lowest-scoring vendors in MQ are still among the top 16 firms among over 100 vendors in the heated Data Science market.

Among those not on the quadrant, I’ve been impressed by DataRobot.

Daily Links 04/04/2017

Emotion Detection and Recognition from Text Using Deep Learning

The researchers used a data set of short English text messages labeled by Mechanical Turkers with five emotion classes anger, sadness, fear, happiness, and excitement. A multi-layered neural network was trained to classify text messages by emotion. The model was able to classify anger, sadness, and excitement well but didn’t do well at recognizing fear.

Adapting ideas from neuroscience for AI

We don’t really know why neurons spike. One theory is that they want to be noisy so as to regularize, because we have many more parameters than we have data points. The idea of dropout [a technique developed to help prevent overfitting] is that if you have noisy activations, you can afford to use a much bigger model. That might be why they spike, but we don’t know. Another reason why they might spike is so they can use the analog dimension of time, to code a real value at the time of the spike. This theory has been around for 50 years, but no one knows if it’s right. In certain subsystems, neurons definitely do that, like in judging the relative time of arrival of a signal to two ears so you can get the direction.

Five AI Startup Predictions for 2017

My favorite: “Full stack AI startups actually work”

When you focus on a vertical, you can find high level customer needs that we can meet better with AI, or new needs that can’t be met without AI. These are terrific business opportunities, but they require much more business savvy and subject matter expertise. The generally more technical crowd starting AI startups tend to have neither, and tend to not realize the need for or have the humility to bring in the business and subject matter expertise required to ‘move up the stack’ or ‘go full stack’ as I like to call it.

The Silicon Gourmet: training a neural network to generate cooking recipes

Pears Or To Garnestmeam


¼ lb bones or fresh bread; optional
½ cup flour
1 teaspoon vinegar
¼ teaspoon lime juice
2  eggs

Brown salmon in oil. Add creamed meat and another deep mixture.

Discard filets. Discard head and turn into a nonstick spice. Pour 4 eggs onto clean a thin fat to sink halves.

Brush each with roast and refrigerate.  Lay tart in deep baking dish in chipec sweet body; cut oof with crosswise and onions.  Remove peas and place in a 4-dgg serving. Cover lightly with plastic wrap.  Chill in refrigerator until casseroles are tender and ridges done.  Serve immediately in sugar may be added 2 handles overginger or with boiling water until very cracker pudding is hot.

Yield: 4 servings

Also see In Which a Neural Network Learns to Tell Knock-Knock Jokes

The dialectic of analytics

From Gartner’s report The Life of a Chief Analytics Officer:

Analytics leaders today often serve two masters:

  • “Classic constituents,” with maintenance and development of traditional solutions for business performance measurement, reporting, BI, dashboard enhancements and basic analytics.
  • “Emerging constituents,” with new ideas, prototypes, exploratory programs, and advanced analytics opportunities.

I serve these two masters today in my job as VP, Data Science & Data Products at IQNavigator.

In my capacity as data science lead, we’re exploring innovative data-driven features built on data scientific techniques. In my capacity as data products lead, we are mostly still in the traditional business intelligence space, focusing on reporting and dashboards. Eventually the data products IQN offers will encompass both business intelligence (BI) and machine intelligence (MI) approaches but we have to start with what customers demand, and for now that is BI, not MI. I foresee that eventually MI will entirely eclipse BI but we’re not there yet, at least not in the non-employee labor management space.

I’ve come to believe in the importance of basic reporting and analytics capabilities, and that they should be distributed throughout the organization in self-service fashion. I see these capabilities as mainly playing a role in operational, day-to-day use, not in providing the aha! insights that people are so desperate to find and so sure exists if they only turn the right set of advanced analytic tools and personnel loose on their data.

I also foresee that the data science / machine intelligence space will mainly serve to optimize day to day operations, replacing business intelligence approaches, not surfacing wild organizationally transforming possibilities.

Gartner suggests developing a bimodal capability for managing analytics:

A bimodal capability is the marriage of two distinct, but coherent approaches to creating and delivering business change:

  • Mode 1 is a linear approach to change, emphasizing predictability, accuracy, reliability and stability.
  • Mode 2 is a nonlinear approach that involves learning through iteration, emphasizing agility and speed and, above all, the ability to manage uncertainty.

This applies to more than just analytics, of course. Gartner suggests it for a variety of IT management domains.

What would this look like? IQN already has an approach for product development that is bimodal in nature. We use agile development practices for product development. But we layer on top of it linear, time-based roadmapping as well as Balanced Scorecard departmental management. This is not as clumsy as you might imagine. It is more dialectic than synthetic in how it functions, with conflict occurring between the two approaches that is somehow resolved as we iteratively deliver features out into the marketplace, often on the schedule we promised (though not always).

In my own small world of data science and data products we do something similar, combining agile iterative processes with more linear and traditional project management. We use a Kanban-style process for data science projects but also layer on more waterfall-esque management for capabilities we need to deliver at a certain time to meet roadmap commitments.

I’m not sure I like the word “bimodal” to capture this approach. Maybe I will think of it as “dialectic.”



Paradigm shift: From BI to MI

I listened to a Gartner webinar Information 2020: Uncertainty Drives Opportunity given by Frank Buytendijk yesterday and it got me thinking about the evolution (/revolution?) from business intelligence (BI) to machine intelligence (MI). I see this happening but not as fast as I’d like, as jaded as I am about BI. Buytendijk gave me some ideas for understanding this transformation.

From his book Dealing with Dilemmas, here’s Buytendijk’s formulation of S curves that show the uptake of new technologies and approaches over time, and how they are then replaced by newer technologies and approaches.

Screen Shot 2015-01-21 at 11.43.46 AM

From the book:

A trend starts hopefully; with a lot of passion, a small group of people pioneer a technology, test a new business model, or bring a new product to market. This is usually followed by a phase of disappointment. The new development turns out to be something less than a miracle. Reality kicks in. At some point, best practices emerge and a phase of evolution follows. Product functionality improves, market adoption grows, and the profitability increases. Then something else is introduced, usually by someone else. … This replacement then goes through the same steps.

This is where I think we are with machine intelligence for enterprise software. We’ve reached the end of the line for business intelligence, the prior generation of analytics. It has plateaued. There’s not much more it can do to impact business outcomes–a topic that deserves its own post.

What instead? What next? Machine intelligence. MI not BI. Let’s let computers do what they do well–dispassionately crunch numbers. And let humans do what they do well–add context and ongoing insight and the flexibility that enterprise reality demands. Then weave these together into enterprise software applications that feature embedded, pervasive advanced analytics that optimize business micro-decisions and micro-actions continuously.

We’re not quite ready for that yet. While B2C data science has advanced, B2B data science has hardly launched, outside of some predictive modeling of leads in CRM and a bit of HR analytics. BI for B2B doesn’t give us the value we need. But MI for B2B has barely reached toddlerhood.

We are, in Buytendijk’s terms, in the “eye of ambiguity,” that space where one paradigm is plateauing but another has not yet proved itself. It’s very difficult at this point to jump from one S curve to the next–see how far apart they are?–because the new paradigm has not proven itself yet.

It’s almost Kuhnian, isn’t it?

Recently one of the newish data scientists in my group said, “it seems like a lot of people don’t believe in this.” This, meaning data science. I agreed with him that it had yet to prove its worth in enterprise software and that many people did not believe it ever would. But it seems clear to me that sometime–in five years? ten years?–machines will help humans run enterprise processes much more efficiently and effectively than we are running them now.

My colleague’s comment reminded me of some points Peter Sheahan of ChangeLabs made at the Colorado Technology Association’s APEX conference last November. He proposed that we don’t have to predict the future in order to capitalize on future trends because people are already talking about what’s coming. Instead, we need to release ourselves from legacy biases and practices. This was echoed by Buytendijk in his webinar: “best practices are the solutions for yesterday’s problems.”

It’s exciting to be in on the acceleration at the front of the S curve but frustrating sometimes too. It’s hard to communicate that data science and the machine intelligence it can generate are not the same as business intelligence and data storytelling. People don’t get it. Then a few do. And a few more.

I look forward to being around when it really catches on.

Putting the science in data science

Data science is not just overhyped marketing BS, at least not if you are doing it right.

Owning up to the title of data scientist [Sean McClure | Data Science Central]:

To own up to the title of data scientist means practitioners, vendors and organizations must be held accountable to using the term science, just as is expected from every other scientific discipline. What makes science such a powerful approach to discovery and prediction is the fact that its definition is fully independent of human concerns. Yes, we apply science to the areas we are interested in, and are not immune to bias and even falsification of results. But these deviations of the practice do not survive the scientific approach. They are weeded out by the self-consistent and testable mechanisms that underly the scientific method. There is a natural momentum to science that self-corrects and its ability to do this is fully understandable because what survives is the truth. The truth, whether inline with our wishes or not, is simply the way the world works.

Opinions, tools of the trade, programing languages and ‘best’ practices come and go, but what alway survives is the underlying truth that governs how complex systems operate. That ‘thing’ that does work in real world settings. That concept that does explain the behavior with enough predictive accuracy to solve challenges and help organizations compete. This requires discovery; not engineered systems, business acumen, or vendor software. Those toolsets and approaches are only as powerful as the science that drives their execution and provides them their modeled behavior. It is not a product that defines data science, but an intangible ability to conduct quality research that turns raw resources into usable technology.

Why are we doing this? To make our software better – to help it learn about the world and then, based on that learning, improve business outcomes:

The software of tomorrow isn’t programming ‘simple’ logic into machines to produce some automated output. It is using probabilistic approaches and numerical and statistical methods to ‘learn’ the behavior and act accordingly. The software of tomorrow is aware of the market in which it operates and takes actions that are inline with the models sitting under its hood; models that have been built from intense research on some underlying phenomenon that the software interacts with. Science is now being called upon to be a directly-involved piece of real-world products and for that reason, like never before in history, the demand for ushering in science to help enterprise compete is exploding.

Any time someone equates data science with storytelling I get worked up. Science is not storytelling and neither is data science. There is science to figuring out how the world works and how to make things better based on knowing how it works.

Can you toolify away the data scientists?

From GigaOM:

Machine learning startup, whose founders are University of California, Berkeley, astrophysicists, has raised a $2.5 million series A round of of venture capital. The company claims its software, which was built to analyze telescope imagery, can simplify the process of predicting customer behavior.

And from’s website:

Read how one fast growing Internet company turned to Wise to get a Lead Scoring Application, scrapped their plans to hire a Data Scientist and replaced their custom code with an end-to-end Leading scoring Application in a couple of weeks.

Uh oh. Pretty soon no one’s going to hire data scientists!*

[Aside: Editors will still be needed. How about “lead scoring application” instead of “Leading scoring Application?” Heh. Aside to the aside: predictive lead scoring is among the easiest of data science problems currently confronting humans.]

But it’s not that easy to “toolify” data analysis. We need human data scientists for now, because humans are able to:

  • Frame problems
  • Design features
  • Unfool themselves

What we’re talking about here is freestyle data science – humans supported with advanced data science tools. [See: Tyler Cowen on freestyle chess.]

Not just any old humans. Humans who know how to do data science.

Frame problems
Data science projects do not arrive on an analyst’s desk like graduate school data analysis projects, with the question you are supposed to answer given to you. Instead, data scientists work with business experts to identify areas potentially amenable to optimization with data-based approaches. Then the data scientist does the difficult work of turning an intuition about how data might help into a machine learning problem. She formulates decision problems in terms of expected value calculations. She selects machine learning algorithms that may be useful. She figures out what data might be relevant. She gathers it and preps it (see below) and explores it.

Design features
Just as real-world data science projects do not arrive with a neat question and preselected algorithm preformulated, they also do not arrive with variables all prepped and ready. You start with a set of transaction records (or file system full of logs… or text corpus… or image database) and then you think, “O human self, how can I make this mess of numbers and characters and yeses and nos representative of what’s going on in the real world?” You might log-transform or bin currency amounts. You might need to think about time and how to represent changing trends (exponentially weighted moving average?) You might distill your data – mapping countries to regions for example. You might enrich it – gathering firmographics from Dun & Bradstreet.

You also must think about missing values and outliers. Do you ignore them? Impute and replace them? Machines can be given instruction in how to handle these but they may do it crudely, without understanding the problem domain and without being able to investigate why certain values are missing or outlandish.

Unfool themselves
We humans have an uncanny ability to fool ourselves. And in data analysis, it seems easier than ever to do so. We fool ourselves with leakage. We fool ourselves by cross-validating with correlated data. We fool ourselves when we think a just-so story captures truth. We fool ourselves when we think that big data means we can analyze everything and we don’t need to sample.

[Aside: We are always sampling. We sample in time. We sample from confluences of causes. Analyzing all the data that exists doesn’t mean that we have analyzed all the data that a particular data generation mechanism could generate.]

What humans can do is unfool ourselves. Machines cannot do that because they don’t understand the world well enough. How do we do so? By careful thought about the problem domain and questioning of too-good predictive results. With carefully designed observational studies. With carefully designed and executed experiments.

Care: that’s what humans bring to machine learning and machines do not. We care about the problem domain and the question we want to answer. We care about developing meaningful measures of things going on in that domain. We care about ensuring we don’t fool ourselves.

* Even if we can toolify data science and I have no doubt we will move ever towards that, the tool vendors will still need data scientists. I predict continued full employment. But I may be fooling myself.