Category Archives: links

Daily Links 07/18/2014

RelateIQ and Salesforce: It’s not just about data science | VentureBeat | Big Data | by Andy Byrne, Clari

5 things I wish I knew about Tableau when I started – The Information Lab

One staffing buyer comes to mind when I think about harnessing the competitive spirit of the supplier community. This buyer discloses the ranking and number of placements for each of the top 15 vendors in the program each month on an all-supplier call, and those stats are also provided via email following the call to the vendors and internal stakeholders. This not only brings transparency, builds credibility and creates trust in the program, but it also generates a level of focus and priority to that client because of the open competition it creates.

I’ve just scratched the surface of this, but I hope you got the idea that scalability can mean quite different things. In Big Data (meaning the infrastructure side of it) what you want to compute is pretty well defined, for example some kind of aggregate over your data set, so you’re left with the question of how to parallelize that computation well. In machine learning, you have much more freedom because data is noisy and there’s always some freedom in how you model your data, so you can often get away with computing some variation of what you originally wanted to do and still perform well. Often, this allows you to speed up your computations significantly by decoupling computations. Parallelization is important, too, but alone it won’t get you very far.

Why does data need to have sex? – High Scalability -

Sex is nature’s way of bringing different data sets together, that is our genome, and creating something new that has a chance to survive.

Daily Links 07/01/2014

I’m going to argue here that a business model that could make money for software companies, while benefiting users, is creating an open market for data. Yes, your data. For sale. On an open market. For anyone to buy. Privacy is dead. Isn’t it time we leverage the death of privacy for our own gain?

The idea is to create an ecosystem around the production, consumption, and exploitation of data so that all the players can get the energy they need to live and prosper.

You need a custom MapReduce programmer every time you want to get something out of Hadoop, but that’s not the case for Spark, said Mathew. Alteryx is working toward a standardized Spark interface for asking questions directly against data sets, which broadens Spark’s accessibility from hundreds of thousands of data scientists to millions of data analysts — folks who know who to write SQL queries and model data effectively, but aren’t experts in writing MapReduce programming jobs in Java.

The Spark framework is well equipped to handle those queries, as it exploits the memory spread across all of the servers in a cluster. That means it can run analytics models at blazing-fast speeds compared to MapReduce: Programs can go as much as 100 times faster in memory or 10 times faster on disk. Those performance enhancements — and the subsequent customer demand – has prompted Hadoop distribution vendors like Cloudera and MapR to support Spark.

Namely, as enterprise applications become more data-centric, the roles of data scientist and application developer are merging. In the short-term, this means the two roles must learn collaborate more effectively and both must assume new ways of thinking. For data scientists, this means starting to think more about how the insights they uncover can be translated into repeatable form factors consumable by end-users. And application developers need to gain a better understanding of data flows and how analytic requirements impact application performance.

Daily Links 06/26/2014

Why your kids will want to be data scientists

According to Burtch Works’ 2014 study of salaries for data scientists – typically those with university degrees in a quantitative field of study that are comfortable with programming languages and statistical methods – the median salary for employees not working as part of a team was $80,000 for those with 0-3 years’ experience and $150,000 for those with 9 or more years’ experience.

At the managerial level the median salaries were higher, with those responsible for a team of 1-3 earning $140,000 and those responsible for a team of 10 or more earning $232,500.

By contrast, the mean average annual income for a lawyer in America was $131,990 in 2013, while doctors earned $183,940, according to data from the U.S. Bureau of Labor Statistics.

Daily Links 06/09/2014

The reason I’m skeptical is because I believe in the science portion of our field’s name. One of the primary things that separates a data scientist from someone just building models is the ability to think carefully about things like endogeneity, causal inference, and experimental and quasi-experimental design. Data scientists must understand and think about things like data generating processes and reason through how misspecifying them could influence or undermine the inferences they draw from their analyses.

But what data can do is it can disprove things, often quite easily. While Scott Winship will argue to death that Piketty’s market-income data is not the best kind of data to understand changes in income inequality, but what you can’t do is proclaim or expound a theory explaining a decrease in market income inequality.

Daily Links 06/08/2014

The basic idea here was that if we exposed students more directly to the educational market that Bennett had identified—making them borrow the money to attend—we could then count on those self-interested economic actors to behave as consumers are supposed to and do something about the problem. The policy was “to re-emphasize self-help,” per the New York Times. But this particular market has never worked that way, and the only effect, of course, was to raise up the Himalayas of student debt that are such a familiar part of the landscape today.

If we can’t or won’t speak in our authentic voices, if we disconnect from our own inner authority, if we refuse to ask for what we need (or don’t know what we need), how can men and women reach across the divide that separates them and recognize each other for who we truly are? Maybe we shouldn’t cluck our tongues over the rising divorce rate; maybe we should just be awed and amazed that men and women stay together for any length of time at all.

Nilofer Merchant suggests a small idea that just might have a big impact on your life and health: Next time you have a one-on-one meeting, make it into a “walking meeting” — and let ideas flow while you walk and talk.

Wonder if any of my colleagues would be up for that…

Big Data teaches you to build these systems using an architecture that takes advantage of clustered hardware along with new tools designed specifically to capture and analyze web-scale data. It describes a scalable, easy to understand approach to big data systems that can be built and run by a small team. Following a realistic example, this book guides readers through the theory of big data systems, how to implement them in practice, and how to deploy and operate them once they’re built.

Chapter one “a new paradigm for big data” is free – need to check it out.

Cloud security trends: The significant six-pack | Smarter Computing Blog

Instead of protecting data portals or pipelines, data-centric security focuses on data’s three states:

at rest,
in motion
and in use.

Daily Links 06/06/2014

Weiner, who in the past life worked for Warner Brothers and Yahoo, said his company wants to build the “economy graph” which would allow LinkedIn to map jobs to skills, talent, companies and geographies.

Ever since the iPad landed in the market as a clear luxury item without a specifically defined use case, market researchers have been tracking it, alongside other tablets, to figure out what we actually use them for and where. In short, we use tablets almost everywhere we don’t use laptops, or where we would use laptops in an absolute pinch but would prefer not to: bedrooms, living rooms, bathrooms, and kitchens. Tablets have filled in a particular kind of pared-down computer experience everywhere the laptop wasn’t, or everywhere that it might have been incidentally but not quite suited to the job—sitting open and waiting to be spilled on, or nestled in some blankets on a couch or in a bed, the less-than-capacious battery trickling away.

Daily Links 06/05/2014

School funding declined in 2012 for the first time in 35 years, reports the Census Bureau. New York was the top spender, at $19,552 per pupil, while Utah spent only $6,206.

The ultimate goal is to reduce friction, error, and deliver the value of predictive analytics accessibly and quickly. “It’s all about not treating advanced analytics as this big scary thing requiring PhD’s and a big cumbersome architecture,” Hillion says. Instead, the goal is to allow enterprises to “look at an existing business problem and get results this week.”

The two classes of women also defined “sluttiness” differently, but neither definition had much to do with sexual behavior. The rich ones saw it as “trashiness,” or anything that implied an inability to dress and behave like an upper-middle-class person….

The poorer women, meanwhile, would regard the richer ones as “slutty” for their seeming rudeness and proclivity for traveling in tight-knit herds. As one woman said, “Sorority girls are kind of whorish and unfriendly and very cliquey.”

Daily Links 05/20/2014

One key issue is information leakage, and we discuss its definition, influence, detection and avoidance. We consider leakage to be the silent killer of many predictive modeling projects, and we demonstrate its impact on the competitions, and discuss the challenges in addressing it in the real-life projects. Other challenges include framing real-life modeling objectives into predictive modeling, and usefully applying relational learning concepts when modeling “real-life” complex, relational datasets.

Daily Links 05/14/2014

We need to shift our focus to building companies around data first and software second.

A trend in big data these days is combining different types of data to get whole new ideas about your product, your company, and the world at large. Companies can put lots of data in repositories like data warehouses or even open-source Hadoop software, but then it isn’t always easy to access.

The measurement of quality and effectiveness is a top priority for enterprises in regards to contingent workforce management, on par with “cost” and “visibility” as areas of focus. However, most enterprises lack the capabilities to accurately measure the quality of their contract talent. One starting point is to assign clearly-defined objectives, goals and milestones for all contingent workers, regardless of scope or talent level. This will allow any CWM program to gauge the effectiveness of their contingent workforce and utilize that information to make educated decisions in the future.

Daily Links 05/11/2014

T-Mobile asks job applicants to take this test before inviting them for an interview because the company has found powerful correlations between the online assessments and success on the job. High scorers tend to resolve customer calls about 25 seconds faster than those who receive low scores. That means they can handle one more call a day and about 250 more a year. [Via Tyler Cowen]

Most of a data scientist’s time is spent creating predictive models: finding the variables that matter to make predictions, the right type of model, the best set of parameters, etc. Work is being done to automate all of this, and so far it has resulted in solutions such as Emerald Logic’s FACET and in the creation of prediction APIs such as Google’s and Ersatz Labs’. These APIs abstract away the complexities of learning models from data. You can just focus on preparing the data (collecting/enriching/cleaning it), you then send that data to the API, it automatically creates a model, and it uses that model when you ask for predictions.

No. Most of a data scientist’s time is spent understanding business problems to solve, collecting and preparing data that can solve the problems, and later communicating results. Predictive modeling takes only a small portion of the time and it is not the only kind of analysis work that is done.

The Forgotten Job of a Data Scientist: Editing – John Foreman, Data Scientist

If your job is building models, all you do is try to build models. A data scientist’s job should be to assist the business using data regardless of whether that’s through predictive modeling, simulation, optimization modeling, data mining, visualization, or summary reporting.