Many enterprises would like to analyze data across multiple domains in order to arrive at valuable new insights. One way to achieve this goal is by creating a Data Lake, where all data from all domains is dumped as-is into a central repository. Each data element in a Data Lake has a unique identifier as well as a set of metadata tags that indicate the data’s lineage, reliability, etc. When a business question arises, the data lake can be queried to find all relevant data, and that smaller set of relevant data can then be analyzed to help answer the question.
Data Lake projects typically involve some major risks:
- The central Data Lake repository in many cases requires Big Data technology, due to the sheer size of all the accumulated data and its associated metadata. For many enterprises, this will be their first encounter with the world of Big Data. To succeed, the enterprise will have to choose the right architecture, and hire or train Big Data engineers. There certainly is some risk associated with launching the first Big Data project in any organization.
- A Data Lake architecture effectively pushes the technical challenge downstream from IT to the end user. The architecture is based on the proposition that users will have the technical know-how required to manipulate and analyze data from the Data Lake once it has been built. This is a dubious proposition, as most users are from the business side of the house. In the absence of such skills, the Data Lake rapidly degenerates into a Data Swamp.
A similar problem exists in the Internet of Things (IoT) world. Out at the edge of the IoT architecture, a controller is receiving data such as temperature, speed, vibration and pressure at high speed from local sensors. The data must be transmitted to a central database in the Cloud in order to be analyzed. But, the volume of data may be so high that it cannot all be transferred as is over the Cloud. Furthermore, the centralized cloud database may not have the computing power necessary to process all the data from all of the remote sensors.
The solution in the IoT world is edge computing – we put some computing power at the edge, in the controller, to pre-process the raw sensor data, and transmit over the cloud a digest of that data, which requires only a fraction of the bandwidth. This architecture is often referred to as “fog computing”, where fog is simply a cloud that is close to the ground. In a fog computing architecture, parts of the cloud (e.g., aggregation) are brought closer to the data source, in order to reduce the demands on the network and the centralized cloud database.
This paradigm of fog computing can also be used as a low risk alternative to the Data Lake. Rather than dumping all the data as-is into the Data Lake and then expecting users to process data from the central repository, a fog computing architecture assumes that each data source performs its own initial analysis on its data, e.g., creates aggregations, in order to form a data digest. The volume of data that each data source then dumps into the Data Lake is vastly reduced, to the digest of information from that data source. The result is a far smaller Data Lake that is far easier to analyze in order to arrive at cross-domain insights.
Taking this fog architecture one step further, the Data Lake could be replaced entirely, by a self-service BI tool such as Tableau or QlikView. Rather than transferring data digests to a central Data Lake, store each digest in its original data source, and then use the client-side data mashup capability of a tool like Tableau in order to load up the individual digests into a client-side tool and perform cross-domain analysis.
Given the expense and risk involved in creating a Data Lake, these fog computing alternatives are certainly worth considering before you launch your Data Lake project.