The Data Balancing Act

Big Data

It is the much sought after, but frustratingly elusive, holy grail of business intelligence: the ‘single version of the truth’. The sourcing of data, addressing quality issues, integrating disparate sources, enforcing key business rules, curating master data and making sure it all happens consistently within a batch update window so the business has the information it needs to operate, devour many hours of time.

While the tools and practices have improved and adapted to become more agile over the years, adding a new data set to the warehouse doesn’t happen fast if you put it through even the basics of these practices. Never mind the fact that the request is likely to be queued up behind other requests the data warehouse team is handling.

So how do you tackle this? If you’re a business analyst, you just bypass the data warehouse team and BYOD. Sometimes you pull from the warehouse, sometimes the source systems, sometimes from a spreadsheet. You fix data issues yourself. Need data enrichment?You enrich the data yourself. What’s the worry?

There’s validity to allowing analysts to BYOD. Some business decisions can’t wait. For many questions, data that is ‘close enough’ is good enough. But the downside here is twofold:analysts often don’t have the tools they need to efficiently deal with the data. And very often you get duplicated effort among analysts and inconsistencies between data sets.So, you can never be sure if one analysis is really comparable to another. Did the number really change?Or just the way the data was prepared?

So how do you balance the need for dependable data against the need for decision speed?

You need to:

  • Define what correct means
  • Manage the Data Maturity Lifecycle
  • Enable analysts within the process

Understanding What ‘Correct’ Means

Getting the data ‘correct’ isn’t a one-size-fits-all affair. The meaning of ‘correct’ depends on the requirements. Different aspects of ‘correct’ are trade-offs against each other, and very few data sets require hitting all of these aspects at once. Getting the data correct can imply:

  • Accuracy: The number reported accurately reflects what happened. Accuracy tends to be a major concern for regulated reporting.
  • Timeliness: The amount of time that can elapse between an event and the associated data being reported. Timeliness tends to be a major concern when a lag in decision making has a significant effect.
  • Consistency: The number reported is consistent with source systems, related business systems and with numbers previously reported. This is both in terms of reporting a consistent value and having a consistent definition. Any comparative analysis (year-over-year) demands consistency.
  • Quality: Quality can be thought of as a subset of accuracy, but it involves some specialised concepts. Quality encapsulates things like getting addresses correct and ensuring that fields follow business rules, data is not duplicated and records aren’t orphaned. If actions are directly affected by data quality, this becomes a higher concern.
  • Performance: Can I get a response to my question before I forget what my question was? As more people use the system, will performance scale? This is one of the largest efforts in warehousing: modelling and tuning the data to ensure performance.
  • Security: After working so hard to make data available, you also have to make sure it isn’t ‘too available.’ This ranges from helping users cut through the noise to preventing insider trading and other legal issues.

For any data set, everyone needs to be clear on what ‘correct’ means. And if the business doesn’t require one of these attributes, don’t impose it unnecessarily. It will slow you down in meeting requirements.

Less Is More

The first mistake is trying to apply ‘too much correctness’ to a data set when not required. The second mistake is trying to get there all at once. One of the key lessons I’ve taken from my agile programmer colleagues is the idea of a ‘minimally viable product.’ Basically, the idea is that you don’t have to get to the final state all at once. You just need to deliver enough value now to meet the immediate requirement, and then mature the implementation in response to user feedback. What does this mean with data?

When working with new data sets, requirements are volatile. The analysts are still learning what they can do with the data. I take the requirements they have, and I try to implement them with the least amount of overall ‘processing’ in the initial pass. I will do as little transformation, quality, modelling, tuning, metadata building and other data work as possible. I draw the shortest path from data source to initial view of the data for my end users and deliver it. Direct access to source system data?If it works for the initial requirement, yes.

It will take some sophistication to help users understand which data sets are ‘fully fledged’ and which ones are ‘still baking,’ but the agility it brings is worth it. As requirements mature, I mature the implementation of the data set. In such an approach, you initially sacrifice aspects like scalability or fully conformed dimensions, but gain agility in meeting emerging business needs. Yes, you’ll need to go back in future iterations to address scalability, conforming, etc., but you’ll be more informed by actual usage when you do. You’ll also feel better when the analysts see something for the first time and say “Yeah … that wasn’t it.”

Know Your Data Sources

Business analysts become more data savvy every day. The reality is that the warehouse team should focus on delivering the high-value, hard-to-deliver data sets to support analysts. Then allow them to use the tools and techniques they prefer to conduct their analyses.And, when needed, they should be able to bring their own data. In fact, they serve as a kind of R&D department for the warehouse team to identify emerging requirements.

So how do you avoid ‘bad data from ‘untrusted sources’?With BYOD, the focus should shift away from the data, to who is bringing the data. Business leaders, with some ‘advice and consent’ from the BI team, should identify those analysts who have the skills, business knowledge and past performance that indicate they can be trusted to combine standard warehouse data with new data sources, advise business leaders on how the data can be used given concerns like accuracy, and generally be trusted to deal with data at a sophisticated level. They become ‘certified’ and are given the authority (and responsibility) for authoring new data sets. And as their data sets mature, we bring them into the warehouse.

For those that haven’t yet attained this level of trust, they should also be allowed to create data but the business must recognise that the data set being presented should be treated with scepticism. It could still be valuable and lead to some great insight, but may need one of the certified analysts to follow up before making a big decision based on that information. Many organisations get bogged down in single version of the truth fervour which ultimately does not serve the needs of the business. The trick is to balance high-quality data services with highly agile BYOD practices.

Charles Caldwell

Charles Caldwell is the Director of Solutions Engineering and Principal Solutions Architect for Logi Analytics.