It wasn’t long ago that analytics capabilities were viewed as a source of competitive advantage. These days, however, the idea that businesses should be data-driven is accepted as conventional wisdom, and analytics is considered a required capability to compete effectively in most industries. This can be seen in the increasing prevalence of data science teams and jobs. In fact, according to LinkedIn’s 2017 U.S. Emerging Jobs Report, the number of data scientist roles in the US has grown over 650% since 2012.
With this rapid growth, data science teams have emerged in organizations in a variety of different forms and there is no accepted best practice in terms of the most effective way to structure or manage a data science team. While there is no one right approach, there are several common pitfalls teams fall into when starting out or scaling. This article explores four frequently made, yet avoidable mistakes made by organizations when building, operating, and growing data science or business analytics teams. By being vigilant to avoid falling into these traps, analytics teams can set out on the right track to unlocking their full potential impact on organizational decision-making and performance.
1 – Isolating Data Science/Analytics Capabilities
Every organization with a data science team will inevitably be faced with the question of how to position that team within the context of the larger organization at some point – or, more likely, at multiple points – in the team’s lifetime. This can first come up when a team is starting out or, as is often the case in teams that crop up organically, later in the team’s life when it reaches a certain size or influence. Further, this is a question that data science teams will generally need to revisit multiple times as the most effective organizational structure is likely to change over time as the team grows and/or the business needs evolve.
Across organizations, data science teams run the gamut in terms of their degree of centralization. On one side of the spectrum are centralized teams which report to a single head of data science, often called a Chief Data Officer (CDO), and work with business units to address their needs. Centralized teams are beneficial from a resourcing perspective, particularly for smaller companies or organizations with limited resources. Additionally, collaboration, information sharing, methodological consistency, and mentorship opportunities arise more naturally in this setup where data science teams sit and work together on a daily basis.
These benefits, however, are of little value if the organizational structure prevents the data science team from having its maximum potential impact on decision-making as can be the case for centralized data teams. Separation from the business units can result in data scientists being viewed as outsiders by business stakeholders which is reinforced by the fact that the data scientists often do not have the full context for the problems they’re asked to address. Additionally, due to their lack of full visibility into competing priorities of the data science team, individual business units may become frustrated by a perceived lack of attention or output. These dynamics can lead to business units under-utilizing data science teams or treating them as a support function rather than as partners.
At the other end of the spectrum, in a decentralized (or diffused) model, data scientists report to individual business units throughout a company. As full members of the business teams, data scientists in decentralized organizations are generally better equipped with the context and buy-in to be effective problem-solvers and thought partners. Not surprisingly, this comes at a cost. When data scientists are spread throughout the organization, sharing of knowledge, best practices, and insights across teams becomes more difficult and less likely to happen organically. In a field that is constantly evolving like data science, this collaboration and sharing of ideas isn’t just a “nice-to-have.” It’s a necessity.
While a purely centralized or a purely decentralized model may work well for a period of time under a particular set of circumstances, the drawbacks of both are simply too great to provide a solid foundation for lasting success. To maximize effectiveness in the long-term, data science teams must find a way to be simultaneously connected closely with the business units and with each other. Some organizations are able to achieve this while maintaining either a centralized or decentralized team by putting mechanisms in place that foster coordination and alignment both within data science teams and between data science and business teams. Many organizations, though, are moving towards hybrid structures that combine elements of both centralized and decentralized teams.
The right degree of centralization and the best mechanisms to simultaneously achieve intra- and inter-team connectedness depend on factors such as the overall organizational structure and culture and the maturity of the data science team. Leaders should be mindful of the warning signs that a data science team is shifting too far in the direction of either centralization or decentralization and ready to intervene and correct course when necessary.
2 – Optimizing solely for model accuracy
Just as a data science team cannot fulfill its purpose if isolated from the larger organization, a data science model developed without full understanding and consideration of the business context will have little or no value to the business.
Data scientists are problem solvers by nature who have been trained to test their models rigorously in a quest for the model that best fits the data. Yet, the most accurate model may not be the best model for the business case for any number of reasons. For instance, a model which is highly predictive on historical data may be more harmful than helpful if the population for which the business is trying to predict behavior is not sufficiently similar to the population used to train and test the model. Beyond this, even a model with high accuracy scores that succeeds at addressing the business question cannot achieve its purpose of impacting the business if it is not deployable. This could be due to a variety of reasons including technical complexity, cost to implement, or even legal or ethical considerations.
By way of example, consider the “The Netflix Prize,” announced by Netflix in October of 2006. This open competition offered a $1M payout for any person or team who could come up with an algorithm that delivered a 10% improvement in the accuracy of Netflix’s recommender engine which was then based on straightforward linear models. The competition went on for nearly 3 years until one team finally achieved the 10% improvement in September of 2009. The winning solution, which was actually a combination of hundreds of different algorithms, was the culmination of years of work by developers collaborating across the globe. However, Netflix never implemented the solution after determining that the accuracy gains did not justify the engineering work required to implement it.
Netflix did deploy a new model based on other work that came out of the competition. The leading team at the end of the first year of the competition delivered an 8.4% accuracy improvement using an ensemble of 107 algorithms. Still too complex to put into production, Netflix implemented a linear blend of the two underlying algorithms from this solution with the highest performance which delivered a 7.6% improvement in accuracy.
In this example, the best business solution was the one which balanced model fit with the technical complexity and cost of implementation. This solution delivered 75% of the accuracy improvement of the winning model but had infinitely more business value since it was able to be deployed. Further, it was developed in a fraction of the time of the winning model.
While methodologies for measuring model accuracy are taught in any data science course, how to balance that model performance with practical considerations such as development time and deployment cost are more nebulous and less frequently taught. Management practices, such as aligning performance evaluation criteria for data scientists with business outcomes and focusing trainings on business context and deployability considerations rather than just technical skills, can play an important role. Additionally, ensuring the connectedness between the business and data science teams discussed in the previous section is also critical. It is much easier for data scientists working in isolation from the business to fall back on inclinations to milk every bit of predictive power from their models without full consideration for the other factors discussed here.
3 – Restricting Data Access
For data science initiatives to succeed, an organization must prioritize using data and objective analysis over intuition to make decisions. That is to say, it must have a data-driven culture. Yet, many organizations that invest in building data science teams and claim to have data-driven cultures restrict access to data and data tools to only a few employees and/or require business stakeholders to submit a request each time they have a data related question regardless of how simple that question may be. By doing so, they unintentionally undermine the ability to get the full value out of the investments they’ve made in data science capabilities.
To quote Jonathan Corelissen, founder and CEO of DataCamp, “Very few companies expect only professional writers to know how to write. So why ask only professional data scientists to understand and analyze data, at least at a basic level?” When companies treat data access on a need-to-know basis, relegating it to a small number of employees, it causes issues both for the data scientists and for the business stakeholders. Business teams are less-equipped to ask well informed questions while data scientists can end up spending a disproportionate amount of their time answering simple questions the business would prefer to answer itself if equipped to do so. Further, it can be more difficult to get business buy-in on data science initiatives when business stakeholders don’t have some degree of analytics literacy.
For analytics to be part of an organization’s culture, employees across the organization must be empowered to access relevant data and use analytics tools on their own terms. For this to be the case, not only must employees have access to data and tools, they must also be equipped with at least a baseline level of data and analytics knowledge.
Airbnb is a vocal proponent of data democratization. Despite having over 100 data scientists, their data science team determined a few years ago that to scale its influence, it needed to enable employees across the organization to access and interact with data. They open-sourced numerous data tools and took steps to make data more accessible to all employees. However, adoption of these tools remained relatively low because employees outside of the data science team weren’t educated on how to use the tools or interpret the data. To address this issue, Airbnb built its own Data University with courses designed to be accessible and relevant to all employees. The Data University succeeded in increasing utilization of the available data tools. In the first half year after launching, Airbnb saw a 50% increase in the number of employees who use the data platform on a weekly basis.
Of course, most companies aren’t in the position to launch their own data universities. The good news is that going to these lengths isn’t necessary. The important things are to foster an environment of openness and knowledge-sharing and to equip all employees with the data, tools, and understanding they need to make better decisions in their functions.
4 – Not establishing a singular version of the truth
Democratization of data can be more harmful than helpful if there is not organizational alignment with respect to how to use and interpret the data.
Imagine that a CEO sits down for a strategic planning meeting with five of his Senior Executives. He asks the group to update him on last year’s sales as a starting point for their discussion. All eager to impress him, they answer simultaneously. However, each gives a different answer to the same, seemingly simple question. Unable to move on to a strategic discussion about how to grow the business without agreeing on a baseline, they then spend the remainder of the meeting reconciling their numbers. As it turns out, none of the numbers are wrong. Rather, the differences are due to differences in the systems from which each sourced the data as well as definitional choices made by each individual such as whether “sales” includes discounts and rebates, if “last year” means the prior calendar year, the prior fiscal year, or the trailing twelve months, and whether sales are accounted for on a cash or an accrual basis.
While this example may sound trite, it is more common than not to have differing views of performance metrics within a company. This not only results in wasted time reconciling numbers as in the example above, it undermines the ability of the organization to make data-informed decisions by eroding confidence in the data itself. If business stakeholders can’t align on current performance, it’s unlikely they’ll be able to agree on the best strategies to drive the business forward.
The idea of getting to a singular version of the truth can sound overwhelmingly complex and many assume that doing so will require an enormous commitment of time and resources to cleanse data and unify disparate systems. This isn’t necessarily the case. In fact, getting to a singular version of the truth does not necessarily require a single data warehouse, though that can certainly make things easier. It does mean establishing guidelines that govern which performance metrics should be used, how each metric is defined, and what is the authoritative source for each metric.
The process of determining and defining which metrics to use often involves heated debate and tough decisions weighing many competing perspectives. The goal, however, is not to find a perfect metric. Rather, it’s about alignment. As Jean Ross, Director of MIT Sloan School of Management’s Center for Information Systems Research put it plainly: “Getting to one version of the truth doesn’t have anything to do with accuracy, it has everything to do with declaring it.” As Ross further points out, even when data accuracy issues may exist initially, declaring a single version of the truth gives business stakeholders an incentive to improve data quality over time. “Once you tell everyone ‘This is our single source,’ they work pretty hard to make it more accurate.”
Building or scaling a data science team can be a daunting proposition. The success of even the most brilliant team of data scientists trained in the latest and most advanced techniques will be jeopardized if it falls into any of the pitfalls outlined in this article. The good news, though, is that avoiding these mistakes does not require tremendous investments in infrastructure or resources. Rather, the key tools to protect against all of these are hypervigilance along with communication, collaboration, and alignment.