So much talk – but so far not much action. When it comes to big data, everybody agrees on its business potential. However, many projects have yet to get past the pilot stage and still look set to remain in the ‘blue-sky thinking’ slot at the end of the CIO’s annual presentation. Why? Because, despite all the discussions, round tables and seminars, managing and analysing large volumes of different forms of data from multiple sources remains a complex issue.
Implementing and configuring a big data environment can take months. Yet the window of opportunity won’t stay open forever and now’s a good time to learn from the early adopters to avoid some of the pitfalls. Here are 5 tips to get started and do it effectively:
1. Big Data Can Be Small
We’ve all got so hung up on data volumes. Often it’s the variety of data that is the challenge. You may have smaller data sets to exploit, but a broader range of sources and formats to deal with. Be sure to identify all relevant sources, however small, and don’t think that you necessarily have to scale your data computing cluster to hundreds of nodes straight away.
2. All Data Is Valuable
Transactional data used or generated by business applications is the obvious type to use. However, don’t forget the data hidden on servers or in log files, desktops or manufacturing systems, often referred to as ‘dark data’. There is also another even more obscure type; that’s generated by sensors and logs and usually purged after a certain time; the ‘exhaust fumes’ of your operations. Deploy collection mechanisms for both these data types so that they also contribute value.
3. Some Data Can Stay Put
Hadoop is a great storage resource for large data volumes (and it is in itself distributed across clusters). But think ‘distribution beyond Hadoop’. You don’t always need to duplicate and replicate everything. Some data is already in the enterprise data warehouse and can be accessed quickly. Some of it might be better off staying in the same location where it was produced. You can make use of the ‘logical data warehouse’ concept in the big data world as well as in traditional data management.
4. Explore New Processing Resources
Hadoop is not only a repository. It is also an engine that gives businesses the potential to process data and extract meaningful information. A broad ecosystem of tools and programming paradigms exist that cover all use cases of data manipulation. From MapReduce to Spark or from Pig to SQL-on-Hadoop, there are processing resources available that eliminate the need to get data out of the platform.
5. Just Get Started
The best way to learn big data is to experience it. There are now sandbox platforms available that provide out of the box virtual big data environments with all the tools needed to start integrating big data straight away. These can include video tutorials, pre-built connectors for building prototypes and an open online community, all helping users to start scoping tasks and generating code with graphical tools which are far faster than hand coding.
As the value of big data becomes more widely exploited, the market is responding. This looks set to play a valuable role in helping to move projects from the sandbox into production – and to do it rapidly to allow users to start to reap the benefits straight away. It could be time for the big data implementations to finally see the light of day.