Skip links

Data Governance for Advanced Analytics, Machine Learning & Automation

Data Governance seeks to bring order to the chaos that emerges as organisations turn to data to make decisions. To many, the term “data” is synonymous with “fact”, forgetting the fact that data requires context to be properly interpreted. Context varies by department, business unit across teams and even within teams, leading to misinterpretation of data and increasing the risk that data is misrepresented. Data Governance helps to define rules, processes and policies for how data is collected, managed to ensure that it can be consistently interpreted and applied to advanced analytics, machine learning and automation.

Analytics platforms like PowerBI, Tableau and Looker put data in the hands of employees and promise better decisions based on real data. But what happens when a colleagues number is different from yours? Or when people don’t know which version of a dashboard to view? Or when you discover the “Sales metric” you have been making decisions on doesn’t actually include all the things you thought it does?

Machine Learning has the capability to transform all aspects of business, from recommending products to your customers, optimising call routing in your call centre through to segmenting marketing messages. As organisations become more reliant on these technologies, they are more likely to incorporate them deeper into their processes leaving them vulnerable to prickly edge cases. Machine learning fails are already well-publicised, from sexist recruitment AI at Amazon, or Microsoft’s AI Twitter Bot “Tay” who lasted less than 24 hours before going full racist.

Automation, especially when combined with Machine Learning makes organisations more efficient and more scalable. Marketing Automation provides marketers with super-powers to deliver the right message to the right customer at the right time. However, when automation fails (often due to errors in data), the problems are amplified – as one mistake can set off a chain of events that can destroy organisations.

Introduction

To help illustrate the importance of Data Governance, we will use a fictional company, which has run into an equally fictional set of challenges. Forrest Corp, is an online book retailer with approximately 100 staff, based in Melbourne, Australia. As with many organisations of this size, Forest Corp has functional teams responsible for Marketing, Finance, Technology, Customer Service and Operations. Forrest Corp has been wildly successful, with sales exploding throughout COVID-19 lockdowns.

The problem

The executive team is very displeased that gross margins and profitability have deteriorated sharply, and that their personal (managed) Instagram feeds are blowing up with negative messages. They want answers! In speaking with leaders from different teams, they receive mixed messages:

  • Marketing is excited to show a 10% increase in sales due to a Social Media Influencer campaign and finds it difficult to believe things are as bad as they are made out to be
  • Finance says sales are flat, with rising inventories, shipping costs and customer retention costs hurting margins and profitability
  • Customer service is inundated with complaints, but they feel they have it under control using their existing policies and procedures
  • Operations confirm the increase in sales, but their team is overworked in dealing with the influx of sales and a large number of returns.

An analysis by an external analytics consultancy begins to unravel the story:

  • The Marketing team recently established a new Social Media Influencer program, enabling influencers to sign up to earn referral commissions on books they recommend.
  • A very popular Social Media Influencer has recommended a particular book to their large follower base
  • The rapid increase in sales of the book have impacted the company’s recommendation model, leading to it being recommended to a much wider audience of customers
  • It turns out that the Social Media Influencer has misrepresented the book leading to a high rate of return from dissatisfied customers
  • In order to placate angry customers, customer Service follows its procedure of sending gifts to aggrieved customers, as recommended by their recommendation engine.
  • The gifts only inflame the situation, infuriated customers are often sent the same book as they initially returned.

How could this all happen?

A failure of Data Governance

The Data Governance Institute defines data governance as “a system of decision rights and accountabilities for information-related processes, executed according to agreed-upon models which describe who can take what actions with what information, and when, under what circumstances, using what methods.”

Data governance is often formally practised in large organizations, but many of the concepts are applicable to all organisations seeking to utilise data in decision making.

In Forest Corp’s case, a series of failures in the way it collects, manages and utilizes data led to the cascading problems illustrated above.

Data Quality

Data Quality Foundations Image credit

Data quality is the foundation for all analytics and machine learning. As organisations seek to make more decisions on the basis of data, there is a real risk that decisions can be made incorrectly if the data underpinning them is incorrect, incomplete or inconsistent.

A key failure for Forest Corp’s marketing team was in its lack of screening of Influencer’s past histories. Social Media Influencers could sign up for their affiliate program easily, with critical pieces of information collected without any verification.

In the Forest Corp example, when the Marketing team’s analytics began showing the success of one affiliate code they immediately looked up the influencer behind it. Unfortunately, the URLs behind the profile were unusable. No data was supplied for their Twitter and Facebook handles, and their supplied Instagram handle didn’t actually exist. While contact information was provided, emails and phone calls went unanswered. So the Marketing Team was faced with a challenging scenario – their Social Media Influencer program is clearly generating sales but they are blind as to the content triggering it or the person behind it.

If data is captured incorrectly, it is often difficult or impossible to correct it. Worst of all though is that it’s only when an organisation really needs high-quality data that they realise it was never collected properly. In the Forrest Corp example, they were blind to a massive spike in sales attributable to a single affiliate without any way to tie it back to.

Data Quality Governance Recommendations

Incentivise data quality at the point of collection

Digging deeper into the sign-up process for affiliates, it was discovered that the affiliate systems were built quickly following a mandate from the Executive Team to get the system up and running prior to a major holiday. Additionally, the Marketing Manager’s key KPI’s for the quarter were the number of onboarded affiliates, as well as the sales volume generated by affiliates. The problems with the affiliate signup process was a product of these incentives.

The time pressure to build the onboarding system meant that the initial design could not be implemented in its entirety. As Product Owner, the Marketing Manager had to prioritize which features would make the initial version, and which would be delayed to a future version. Automated verification of affiliates was seen as a large piece of work and thought to add friction to the signup process potentially hurting the Marketing Managers KPIs. It was therefore easy for the Marketing Manager to de-prioritise this feature of the onboarding system.

Guiding principles for incentivizing data quality
  • Key Performance Indicators that create disincentives for data quality should be avoided.
  • Analyse and understand the unintended consequences of performance metrics on data quality
  • Incorporate data quality metrics into KPI’s for Product Owners whos products involve data capture

Projects that involve data collection require oversight and guidance, and periodic review

Without a process of oversight in how data was captured, the Marketing Manager was free to make decisions that ultimately caused the misleading Influencer to speak fraudulently on behalf of their organisation. However, the Marketing Manager is rightly focussed on Marketing, not issues of Data Governance. An organisation can’t and shouldn’t expect all its managers to be Data Governance experts.

Guiding principles for data quality oversight
  • A Data Governance Council should provide advice on issues relating to how projects are implemented and the potential costs and risks of different implementations.
  • Data quality metrics should be established with any project that involves data capture
  • A Data Governance Council should periodically review or audit key systems to see how they are actually being used and assess the quality of the data they capture.

Understand the pressures on your users when entering data

Equally, in designing the affiliate sign-up form, the Marketing Team failed to understand the pressures on Social Media Influencers. One of these pressures is the need to sign up to a lot of affiliate programs, and the fact that time spent signing up to affiliate programs is time not spent building their brand and following. The Forrest Corp Affiliate Signup process, while failing to verify key information, also asked for a lot of information much of which seemed fairly unimportant to them getting paid. Therefore a lot of fields were left blank, even critical ones.

Guiding principles for designing data capture systems

Data entry is often the least enjoyable aspect of a process., but is of critical importance. Systems that collect data need to be user friendly and easy to use. Here are a few principles:

  • Make use of “pick lists” or “autocomplete” to make it easy to link data
  • Make it easy to identify existing entities to reduce the risk of creating duplicate entries
  • Avoid requiring too many fields, or displaying fields that are rarely used. As employees get conditioned to leave fields blank, it will start happening to important fields
  • Avoid overly complicated validation rules, especially those which reject sensible inputs (e.g. mandating dashes in phone numbers)
  • Avoid requiring employees to enter data in multiple systems as a part of a single task
  • Ensure that common tasks can be completed within one screen
  • Consistently name fields between applications to make it clear what is being collected

Data quality summary

It turns out that the affiliate in question had a history of making misleading claims about products they recommended, which while effective at driving sales left customers with unrealistic and incorrect expectations. A cursory investigation of the influencer’s history would have revealed this. However, a lack of data quality about Influencers (due to a poorly implemented sign-up process) made this difficult and it was not done.

Business Intelligence Governance Recommendations

What happens when your metrics disagree? Image credit

Forrest Corp considers itself to be at the forefront of data-driven decision making. Having recently replaced batch, overnight generation of reports using Crystal Reports with a modern Dashboard based solution using PowerBI, they are rightly proud of their reporting infrastructure. The project was implemented by the existing data team, combined with embedded analytics consultants. The embedded analytics consultants upskilled the existing data team, in a following a “train the trainer” methodology, enabling the upskilled data team to train power-users across the organisation. Before long, the majority of employees was capable of creating their own dashboards. A great outcome… or not?

Avoid inconsistent metric definitions across functions

Inconsistent rules used to calculate “Sales” was a core driver for the challenges facing Forrest Corp above. Marketing’s Sales metric counted all sales, including those that are returned (so a $100 sale that was returned counted as a $100 sale even after its return). Finance’s Sales metric on the other hand incorporated returned goods correctly (so a $100 sale that was returned counted as a $100 sale, but also a -$100 return).

Throughout the period, marketing was blinded to the fact that 80% of the sales affiliated with the influencer were returned. This meant it was unable to quickly react to the emerging problem because it didn’t know there was a problem. The dashboards that the team was so proud of, gave them false confidence that things were going well, when in fact they weren’t. This was obviously incredibly frustrating to the Executive Team and Board of Directors as time was wasted convincing them of the problem.

Guiding principles to avoid inconsistent metric definitions across functions
  • Metrics need to be defined and documented consistently across the organisation, with agreement from all organisational units.
  • Any metric appearing on multiple dashboards should have the same value for the same period
  • New and changed dashboards should be reviewed by the analytics team for opportunities to consolidate reports and to check their consistency
  • Any changes to calculation methodology of metrics should be widely communicated
  • Metric definitions should be easily viewable within any platform that displays them

Decentralised dashboard creation

A key benefit of systems like PowerBI is that they enable end-users to create their own dashboards and metrics. Within each organisational unit, the power users went about doing just this, creating custom dashboards for their teams. Even within each team, different users wanted subtly different versions of the same thing, tailored to their own use-cases and ways of working. This constant customisation (often completed by end-users) ended up with 10s of dashboards purporting to communicate the same message with subtle differences, each with multiple versions.

While the carefully modelled data warehouse reflected the Forrest Corp’s “single version of the truth” the constant customisation of the presentation layer muddied the water, making it difficult to know which dashboard was right.

Guiding principles for decentralised dashboard creation
  • Generally, empowering users to create and modify their own dashboards is a good idea as it improves data literacy & ownership of data
  • Limit the ability for end-users to customize how metrics are calculated
  • Labels matter! Periodically review dashboards to ensure metrics and charts are consistently labelled and have consistent units applied
  • Consolidate dashboards wherever possible

Machine Learning Governance Reccomendations

What happens when AI / Machine Learning goes wrong? Image credit

Forrest Corp makes extensive use of machine learning in its recommendation systems. A recommendation system is a powerful way to increase sales by suggesting products and services that a user is likely to respond favourably to based on their previous purchasing behaviour, and the behaviour of others. At Forrest Corp, they were especially proud of their recommendation engine, crediting it with a 23% increase in returning customer purchases. However, clearly, the recommendation engine failed in this scenario, with the constant re-recommendation of the one book.

Constantly review machine learning training data for inconsistencies & skew

Machine learning algorithms seek to optimise predictions based on previously seen data. If that data is skewed then the model will be skewed. In this case, a key input into the model was the popularity of the book, calculated excluding any returns, so a book purchased 10 times and returned 2 times, counted as 10 sales. However, when preparing a customer’s previous order history, a purchase that was later returned was treated as if the purchase had never happened. This had the peculiar effect of enabling a book that had been returned to be recommended again.

The massive increase in recommendations a feedback loop within the recommendation engine, as its recommendations increased popularity metrics feeding it which increased recommendations, which increased sales, etc, etc.

Guiding principles for review machine learning training data for inconsistencies & skew
  • Track and store the inputs and outputs of machine learning models
  • Track the statistical distribution of features in training data with those seen in real-world predictions. Changes here may indicate that the model’s assumptions are no longer valid, and it needs to be re-trained
  • Alert engineers if current distributions differ from those the model was trained on
  • Be aware of feedback loops, and the potential for a models outputs (e.g. new sales) to effect its inputs (e.g. popularity).
  • Track external metrics on the performance of the model (e.g. are people still purchasing recommendations?)
  • Routinely audit the model

Understand the limitations of machine learning models

Machine learning models tend to optimize for the general case. This means that they can behave in strange ways against extreme data, which is unlikely to have been in their training dataset. In the Influencer example, return rates were much higher than had ever been seen before and so it began acting in unforeseen ways.

Guiding principles for understanding the limitations of the model
  • Understand the range and shape of the data the model was trained on
  • Validate prediction inputs prior to making a prediction, and if the inputs are outside of ranges seen in training, decline to make a prediction

Recommendations for Automation Governance

Beware of feedback loops between Machine Learning & Automation

Automation when paired with Machine Learning is a can transform businesses. Decisions can be made and actioned in the blink of an eye, enabling new business models and revenue opportunities. However, it can also dramatically amplify problems.

To reduce strain on the customer service organisation, Forrest Corp made it simple to send gifts to upset customers. The gift book was chosen by the recommendation engine and automatically routed for dispatch. This had a transformative impact on the customer service team, reducing the workload of customer service representatives and improving customer satisfaction. However, when the model went wrong, the automation continued like normal, resulting in reputational damage and significant costs in shipping and handling.

Ensure circuit breakers exist between machine learning models and automation

It would have been trivial for customer service representatives to recognize that the book their customers were complaining about shouldn’t be sent to them as a ”free gift”. A simple UI tweak to show the title of the free gift with an option to show the “Next recommendation” would not only have prevented the issue but also made the deficiencies in the model obvious from day one.

Guiding principles for ensuring circuit breakers exist between machine learning models and automation
  • Ensure that automation systems validate the output of machine learning models independently before progressing with an action (e.g. manually checking to see if the book has been purchased before)
  • Consider adding a human into the loop to “Approve” the execution of automation – which also provides an opportunity to collect data on the quality of predictions
  • Ensure that validation rules for automation are reviewed on a regular basis
  • Set policies for thresholds on the value of decisions that can be made by automation, and those that require human approval.

Implementing Data Governance

Following the detailed Root Cause Analysis (RCA) outlined above, the organisation needed to decide how it would implement the recommendations. A big-bang Data Governance initiative was un-palatable to the organization, as it had serious concerns about creating a bureaucracy which would serve only to slow down innovation. These concerns are valid, in that any time additional process is added you add friction to getting work done. This additional friction needs produce significant value in minimizing risks to be of net benefit to the organisation. Consequently, a pragmatic, staged Data Governance implementation was decided upon.

Data Governance Council & Data Stewards

A Data Governance Council was established comprised of Data Stewards from different organisational areas, including Marketing, Finance, Operations, Technology and Customer Support. The goals of the Data Governance Council were defined as:

  1. Improving data quality
  2. Improving the use of data within the organisation
  3. Minimising and controlling data related risks
  4. Minimising and controlling risks related to the operation of automation and machine learning
  5. Minimising the friction caused by Data Governance policies and processes

The Data Governance council reported directly to the board of directors and was responsible for:

  • Ownership of definitions of business metrics
  • Responsibility a quarterly data quality reports for the board
  • Responsibility a quarterly dashboards usage review for the board
  • Responsibility a quarterly machine learning effectiveness report for the board
  • Responsibility for an annual review of management KPI’s to ensure that data quality is prioritised
  • Responsibility for reviewing new technology projects in terms of their impact on data quality, data integration and reporting

To manage the scope of the Data Governance council (and therefore its workload), its remit was limited to core data sets and processes, including:

  • Reporting metrics
  • Customer data
  • Partner data (including Influencers!)
  • Product data
  • All customer-facing machine learning

The remit was set to be reviewed annually by the board to ensure that the Data Governance processes were creating value for the organisation.

Determining the value of Data Governance through metrics

For each of the goals of the Data Governance initiatives, Key Performance Indicators were established to track the progress over time to ensure that progress was being made in the right areas.

AreaKPI
Improving data quality• The percent of missing data (Automated tooling)

• Percent of incorrect data (Automated tooling)

• Employee surveys estimating: Weekly time spent on tasks caused by poor data quality
Improving the use of data within the organisation• Monthly Active Users on PowerBI

• Employee surveys estimating the weekly time spent analysing data in Excel because a relevant dashboard doesn’t exist
Minimising and controlling data related risks• The number of unresolved data related issues on the corporate risk register

• The number of resolved data related issues on the corporate risk register

• The number of Data related security incidents
Minimising and controlling risks related to the operation of automation and machine learning• Machine learning performance metrics

• Automation related incidents
Minimising the friction caused by Data Governance policies and processes• Employee surveys estimating: Weekly time spent on tasks related to data governance processes

While these metrics are far from perfect, when combined they provide a benchmark for establishing a Data Governance Practice where none had existed before.

Data Stewards

Initially, Forrest Corp had wanted the Data Governance function to reside within the Technology division, delivered by the Data & Analytics team. Ultimately this position was rejected due to:

  • It creates a perception that data was owned by Technology, rather than the whole organisation, which went against their cultural aspirations to be data-driven
  • Technology administering and defining process would likely make the rest of the organisation feel that Technology was a source of friction rather than innovation

Forrest Corp decided to incorporate Data Stewardship into the position descriptions for the heads of each business function. Therefore each Data Steward had accountability for progressing all of the Data Governance Goals in their respective domains. The accountability was then delegated to team members within the function, with the role of Data Steward essentially being one of oversight. Each Data Steward had a specific budget to spend on initiatives that were often delivered by the Technology Function and external data consultants.

The Data Governance Council met monthly, providing an opportunity for the different functions to come together to talk data, catching many issues in their infancy before they had time to develop into serious problems.

Conclusion

The formation of the Data Governance Council was initially painful for executives as they grappled with accountability for systems they didn’t really understand. However, this created strong incentives for them to improve their knowledge. The Data and Analytics team worked hard to explain core concepts in plain English to executives, and enjoyed robust debates around controls, process and policies. The engagement between executives and the Data & Analytics team had the benefit of improving confidence in advanced analytics, enabling significant new innovation.

While Forrest Corp is a fictional organisation, the challenges and their solutions are real. Data quality is a key challenge for all organisations and is often a key impediment to advanced analytics and an outright roadblock for many machine learning projects. Modern Business Intelligence (BI) platforms have the power to democratize data and improve decision making, but they can lead to a quagmire of inconsistency if left to grow unconstrained. Just as Machine Learning and Automation can drive innovation and operational efficiencies it can also amplify problems

The investment in Data Quality extended beyond the Social Media Influencer program into new areas such as their web analytics. As data quality improved, it became feasible to invest in new Machine Learning approaches to customer segmentation which dramatically improved marketing effectiveness.

As organisations seek to leverage the power of Advanced Analytics, Machine Learning & Automation, at some point they will realize the benefits of data governance. The only question is whether it will take a serious incident to initiate the process, or if they do it in a controlled manner that protects them from incidents in the first place.

About Growing Data

Growing Data is an Australian consultancy based in Melbourne, that helps organisations unlock the full potential of their data. We partner with you on your journey, providing a considered pathway to the future that illuminates the road ahead, reduces uncertainty, and keeps you adaptable to change.

Learn more about our services in Data strategy & Governance, or start a conversation with one of our consultants below.