Imagine you have a warehouse with several doorways. A truck comes in and dumps several tons of rice grains on the floor making a pile so high that you cannot see over the top. You are told that in this pile there are some small diamonds, a few titanium strips and a lot of small gold bars. All you have to do is find them. Unfortunately while you are doing this, the rest of the doors open and even more trucks arrive and dump more rice grains on the floor. It is going to be almost impossible to find the valuable items amongst so many grains of rice.
This is the scenario that many companies face when trying to find valuable information among the mass of data that arrives in their company every day. In a business scenario, this mass of information is called Big Data.
Big Data has capital letters for a reason. With smart phones, web sites, social marketing and phone support all collecting data the mass of information that even a medium sized company amasses is growing every day.
What challenges does a company face when trying to find these valuable nuggets that will make them more profitable and enable them to grab market share?
VoucherBin.co.uk, the UK's leading voucher site, recently reviewed the volume of Big Data that they amass each day and found that there are several major problems associated with making this information profitable.
Storing the information: Even though the cost of storage has tumbled in the last few years and you can now keep 1TB of data on a small flash drive, the costs and technicalities of storing Big Data in a format that can be easily accessed are enormous. They fall into several areas: resilient hardware and networks to store the information; the appropriate software; and the costs of keeping this technical platform cool, updated and constantly available. Conventional relational databases (SQL) require a vertical storage capacity with a single server hosting a single database. As your database grows, so does your server. Splitting a database across servers, known as "sharding" can be achieved but it involves SAN (Storage Area Network) that is a complex high-speed network of storage devices that connect to the servers and enable block based storage. Sharding also requires complex coding, to distribute the data evenly over the servers, allocate the data queries to the correct place and then aggregate the results obtained. The databases must be coded such that they have natural and appropriate joins with the data requirements balanced and optimally fast. Sharding is complex to set up and increases the risk of data attacks and, if not set up correctly, decrease resilience and negates many of the benefits of a relational database.
Ensuring the data is safe and resilient: The UK based Data Protection Act requires customer data to be secure and commercial activities demand resilience within a technical platform. The more complex a technical platform is, the more difficult and expensive it is to provide both security and resilience. The recent loss of the seemingly innocuous and sturdy BBC website due to a Denial of Service (DoS) attack in December 2015, is just one example of what can happen when a determined attacker targets a company.
Finding the appropriate information: Many senior IT managers and consultants will remember the rush to build massive data warehouses at the beginning of this century and how most of them were doomed to failure because the software and hardware just could not manage to hold so much information and then find just the wanted nuggets. The doomed ID Card project and the multiple DWP projects that have cost the UK government £100m's because the requirements were just too big for the technical platform are just a few of the more public examples. Many other companies quietly closed their projects and took massive cost write-offs. In the last few years relational databases have become superseded by NoSQL (Not only SQL) databases and the stranglehold that Oracle and SAP have on the database market is being challenged by companies such as MongoDB, DataStax, Redis Labs, MarkLogic and even Amazon Web Services. NoSQL databases have a number of advantages. They:
Can handle large amounts of structured, unstructured and polymorphic data.
Are ideal for rapid development using Agile sprints that produce new code within days rather than months.
Can develop applications that are always on, be scaled to millions of users and handle data from multiple devices.
Enable auto sharding, removing much of the complexities of both the technical platform and its development.
Big Data management is going through a massive change in the last two years and will continue to do so for a few years to come. The likes of Amazon, who must manage their own Big Data on a massive scale, are offering Cloud and NoSQL database facilities (Redshift and DynomoDB) on an enterprise scale and other major companies such as Google (Big Query) are rushing to gain market share. Microsoft and Terradata are frantically developing their own offerings, competing with new start-ups such as the unicorn Snowflake.
There are still two big problems that may not be immediately apparent and they are:
Deciding what information you need: If a company uses NoSQL they can at least change their minds as to what information they wish to extract but prudence would suggest the use of skilled and experienced Big Data consultants to identify information priorities.
Employing the appropriate skills to manage and extract your Big Data: Lastly, this is new technology and that means that staff who understand and can develop these systems are difficult to find and even harder to keep hold of. Many companies have to train their own staff and remunerate them with eye watering salaries in order to keep them.
From the above it is easy to see that managing and profiting from Big Data is still a new, quickly developing art, however those companies that manage their Big Data effectively will greatly profit from this new knowledge.