Why Is Everyone So Hyped Over Big Data?

The security industry has realized that log data is an incredibly rich source of information for detecting security intrusions, and has since developed a taste for more and more logs.

Log Correlation has since then followed as IT professionals realised that individual log entries by themselves meant very little, but when placed into context against one another illustrated more than just system-level events. They illustrated behavioral context — clusters of individual log lines which could be translated into records of human-readable actions.

Security is still in the early days of this science and practice of event correlation: Methods and results are rarely shared with the community, the target for what is effective keeps moving, and yet we’re already talking about Big Data.

Terror and Possibility

This is of course, the intersection of terror and possibility, as we transition from our first fumbling attempts to boil the ocean into a land populated by people who have been doing this stuff for a long time before us. Vast databases of information being mined for emergent patterns and used to process simulations over and over are hardly new to the world — the finance, medical and aerospace industries have spent years in this realm.

How is it, then, that the security world has not previously tapped into this pool of expertise before now to help us glean the knowledge lying dormant within our vast supplies of data? Quite simply, it’s because we still don’t know what questions to ask in the first place.

What’s Out There?

It’s worth performing a short recap on emerging Big Data technologies out there and why they differ from being just “large databases.” Although there are many implementations of these technologies, they all derive from two core functions: NoSQL and MapReduce.

NoSQL is a difficult beast to define even among the experts in that field. What you need to know up front as a security practitioner, however, is that NoSQL can be defined by:

  • Lack of strongly structured schemas. Unlike an RDBMS, where the schema must we well-defined before data is stored and changes to that schema when live data is present becomes increasingly more unfeasible, NoSQL data stores may freely adapt the nature of the records they store over time.
  • As the name implies, the SQL language is not used to retrieve information from these systems — many systems implement Javascript (JSON and BSON) to perform data queries.
  • They are optimized for rapid retrieval of information at the possible expense of consistency of data (they do not comply to ACID). To wit, they are excellent systems with which to do analytical work but have inherent issues if treated as the authoritative repository.

Accordingly for the same audience, MapReduce’s key features are:

  • The ability to perform information retrieval and calculation over a widely distributed data storage. A practical example would be that if individual devices had their log storage implemented in a MapReduce-capable manner, then a centralized log storage mechanism may no longer be required — a single query could be performed across all logs on all devices simultaneously
  • Inversely, a centralized storage may still exist but spread out over a computing grid of commodity hardware (indeed, this was the reason for Google’s (Nasdaq: GOOG) creation of MapReduce).
  • Generally speaking, there is comparatively little need for the end-user to optimize their query sets to take advantage of MapReduce’s distributed nature.

So, we can immediately see some of the reasons these two technologies have raised excitement and promise to the information security world:

  • Increased speed on complex queries across large quantities of data is a vital force-multiplier for security analysts; the ability to query every machine that has accessed a particular URI in the last 90 days in minutes (not hours or even days) cannot be overlooked.
  • The flexibility to bring additional data to supplement existing records works in lockstep with the inherent nature of security information: that it is comparatively a domain of unstructured data. Freedom from data schemas that fail to take into account the information that is vital to the organization we are trying to defend will allow us to make better correlations and ask better questions from our data.

Between these two factors, we can see where the excitement comes from, and yet we still have to return back to the same issues we’ve struggled with before the advent of Big Data.

What Do You Want to Know?

We still aren’t very good at asking the right questions from our data. In security analytics, it’s often the relations between the data (not the data itself) that is important. Just as detective work is a matter of “connecting the dots,” so are the relations between our data points for the true information (Log Correlation itself is about looking for and exposing those relations).

As IT professionals, we share a particular reticence to trust anything we didn’t do hands-on ourselves; as security professionals, this trait becomes magnified. Perhaps the fact that the concepts we are looking for (exposures, risks, threat surfaces) are so difficult to define that we are still stuck in the stone ages of bar charts and keyword searches when it comes to data analytics.

No amount of Big Data is going to save us until we can learn to formulate better questions for that data. Perhaps it’s time that we accepted that the problems we’re approaching now (trying to boil an ocean of data points into digestible information) is not unique to us. Information security as a discipline may have much to learn from other technology fields. It’s a tough pill to swallow when you think of how much we collectively berate the rest of IT as being the source of all our issues in the first place.

Biology Lesson

I’ll cut to the chase here: BioInformatics. Bioinformatics places emphasis on discovering the nature of interactions and relations between their points of data, since this is intrinsic to how biology operates too. It won’t take long before you find a plethora of advanced (and aesthetically pleasing) visualization techniques being used to present and explore data relations, like the CIRCOS system.

BioInformatics has made great strides in distilling down complex data relationships into advanced visualization techniques that maximize the ability of human pattern recognition abilities to discern inferences that are difficult to make programmatically.

Ask better questions, discover relationships, create hypotheses and test them against more data; rinse, repeat — the scientific method. Big Data will not magically enable us to discern better answers until we come up with better questions to explore the relationships between our data more thoroughly.

The field of log correlation could make great strides if were we to establish an open format for exchanging ideas for correlations in a vendor-neutral manner and collectively discuss what is effective within the field instead of how we operate today.

Information security is evolving into areas well explored within other fields. Our issues with discovering relations and implications from our oceans of unstructured data are at the heart of the field of complex event processing.

We’re moving into territory where we are not as alone as we think; if we are going to reap the benefits that Big Data promises and not let this become another failed fad, then we have to start overcoming our isolationist attitude and start inviting experts from other disciplines to join us and teach us how to use this new toolset.

Conrad Constantine is Research Team Engineer at AlienVault. Mention “Commodore Amiga” around Conrad and watch him get misty-eyed for the ‘good old days’ of the computer underground and the Demo-writing scene. With an early background in searching for forbidden knowledge, pushing computing hardware to its limits and a nose for the truth, Conrad was born for a career in Incident Response. Over the last decade and a half, he has been on the front lines of defence work in telecom, medical and media corporations, not least of which being at ground zero for the 2011 RSA Breach. Conrad’s a firm believer that incident response must become an accessible and effective discipline, available to all. He’s striving to bring the mysteries of open source intelligence generation, and defensive agility, to those willing to take the leap from fear to action - mostly via the medium of code, (with visio diagrams thrown in for good measure).