Will Your Database Scale For Big Data?

BI Database

BI tools are playing an increasing role in the data estates of many big businesses. They hold out the promise of offering the end user the opportunity to do self-service reporting and analysis on data. This can make the business more responsive and also free up expensive database administrator (DBA) time.

However, this freedom comes at a cost. BI tools are by their nature unconstrained, allowing the user to ask almost any question of the data. This places an increased strain on the underlying database because this kind of behaviour is very difficult to optimise for. The result, as many businesses are finding, is a very large investment in hardware. And the reason for this expense is very simple; relational database technology was never designed to operate at this scale of data.

Linear Scalability Is No Longer Good Enough

When relational database technology began to emerge forty years ago it was based on set algebra. This was fine for small tables, but it was soon swamped by data growth. New relational database technologies sprang up. These used indices and aggregates to accelerate queries and meant these databases would scale linearly as data grew.

However, there was a cost. You cannot compute or store every possible index or aggregate. They had to be chosen in advance of delivery based on a study of how the database was going to be used. If this was wrong, or new requirements emerged, then the whole design could be compromised and performance would soon become unacceptable.

Of course if the requirement is to connect to a BI tool, we may be beaten before we even start because the users’ behaviour of these tools is so difficult to predict. All the major vendors are today offering in-memory versions of their products. However, these are still in-memory versions of relational databases, whose performance scales linearly with data size. Put enough data against them and they will either run out of time or run out of memory.

Try A Living Pattern Database Instead

Recently, a new type of database has been developed – the living pattern database. This type of database doesn’t store data in rows or columns, but instead considers a row to be a point in a living space of patterns. Patterns can be structured as simple sets of possible values for fields or more complex patterns that are combinations of other patterns, or groups of patterns that are linked by a common property.

All of the data exists in the pattern space, but it is no longer held in a fixed structure. When it is time to query the data, the living pattern database does not directly run the query it was given, instead it transforms the query into an equivalent property of the pattern space and then tests if that property holds. Further, the pattern space will respond to the requests it receives by evolving; adding new patterns by predicting requirements based on past history. These new patterns will accelerate future queries by reducing the amount of computation they require.

So what is the catch? Just like aggregates and indices there are an infinite number of possible pattern sets, so which ones should be held in memory? At this point one of the beauties of the living pattern database emerges; its pattern space is not static, but is live and evolving.

Every pattern set that is put into the pattern space is tagged with two values, which are the age of the set and the cost to compute it. Both are measured in milliseconds. The age actually measures how long it is since the set was last accessed, so it will be reset to zero each time a user employs it to answer a query. When the pattern space begins to become full a cull of the space will take place, which removes the oldest and cheapest pattern sets first. This minimises the impact of the cull on user response times.

This works because the time to solve a query in the presence of the required pattern sets is so short (fractions of a millisecond) that there are plenty of spare cores and cycles for the process of maintaining the pattern space. For example, in a major retailer this type of database supports 4,000 users querying 8 billion rows of sales transaction data with sub-second response times yet the whole solution runs on hardware costing less than £30,000.

Conclusion

Big data presents an unprecedented challenge to the world of unconstrained Business Intelligence queries. Not only is it the case that queries that scale linearly are no longer sufficient, but also the triggering of these queries is outside of the control of the IT team that maintains the system.

The performance of BI tools is constrained by the performance and particularly the ability of the underlying database to optimise these queries. Although many BI tools come with their own data stores that are optimised for the tool, these stores do not scale to sizes larger than the 10s of millions of rows.

Linear scalability is no longer good enough in the face of predicted data growths. To use these tools efficiently you need a database whose response times are better than linear. Such performance is not possible if the database itself is static because of the impossibility of determining what values will be required prior to the user interacting with the BI tool. Therefore a living pattern database that is agile and is continually reorganising its pattern space, just like your brain manages its memory, offers a means to deploy the best of modern BI tools against the huge data sets that we are going to encounter in the future.

Matthew Napleton copy

Since graduating in 2002, with a BA Hons in Drama with a specialism in marketing, Matthew Napleton has worked across many sales disciplines within IT. Starting with Internal sales (working with vendors such as Citrix, and IBM), he moved to an Account Director role at Dataplex, selling Enterprise Solutions as a sales professional. Matthew joined Zizo in 2008 as Business Development Manager, before becoming Director of Marketing in 2012. He is responsible for driving the go-to market strategy for Zizo.