This article is featured in the DZone Guide to Big Data, Business Intelligence, and Analytics – 2015 Edition. Get your free copy for more insightful articles, industry statistics, and more.
With the uptick of Big Data technologies such as Hadoop, Spark, Kafka, and Cassandra, we are witnessing a fundamental change in how IT operations are carried out. Most, if not all, of said Big Data technologies are inherently distributed systems, and many of them have their roots in one of the nowadays dominating Web players, especially Google. But how does using these Big Data technologies impact the daily IT operations of a company aiming to benefit from them? To address this question, we take a deeper look at five trends you and your ops team should be aware of when employing Big Data technologies. These trends emerged in the past 15 years and are of a technological as well as organizational nature.
1. From Scale-Up to Scale-Out
There is a strong tendency throughout all verticals to deploy clusters of commodity machines connected with low-cost networking gear rather than the specialized, proprietary, and typically expensive supercomputers. While likely older than 15 years, Google has spearheaded this movement with its Warehouse-Scale Computing Study. Almost all of the currently available Big Data solutions (especially those that are open source but more on this point below) implicitly assume a scale-out architecture. Need to crunch more data? Add a few machines. Want to process the data faster? Add a few machines.
The adoption of the ‘commodity cluster’ paradigm, however, has two implications that are sometimes overlooked by organizations starting to roll out solutions:
With the ever growing number of machines, sooner or later the question arises if a pure on-premise deployment is sustainable as you will need the space and pay hefty energy bills while typically seeing cluster utilizations in the low 10%.
The current best practice is to effectively create a dedicated cluster per each technology. This means you have a Hadoop cluster, a Kafka cluster, a Storm cluster, a Cassandra cluster, etc.—not only because this siloes issues (in terms of being able to swiftly react to business needs; for example, to accommodate different seasons), but also because the overall TCO tends to increase.
The issues discussed above do not mean you can’t successfully deploy Big Data solutions in your organization at scale; it simply means that you need to be prepared for the long-term operational consequences, such as opex vs. capex, as well as migration scenarios.
2. Open Source Rulez
Open Source plays a fundamental role in Big Data technologies. Organizations adopt it to avoid vendor lock-in and to be less dependent on external entities for bug fixes, or simply to adapt software to their specific needs. The open and usually community-defined APIs ensure transparency; and various bodies, such as the Apache Software Foundation or the Eclipse Foundation, provide guidelines, infrastructure, and tooling for the fair and sustainable advancement of these technologies. Lately, we have also witnessed the rise of foundations such as the Open Data Platform, the Open Container Initiative, or the Cloud Native Computing Foundation, aiming to harmonize and standardize the interplay and packaging of infrastructure and components.
As in the previous case of the commodity clusters, there is a gotcha here: there ain’t no such thing as a free lunch. That is, while the software might be open source and free to use, one still needs the expertise to efficiently and effectively use it. You’ll find yourself in one of two camps: either you’re willing to invest the time and money to build this expertise in-house—for example, hire data engineers and roll your own Hadoop stack—or you externalize it by paying a commercial entity (such as a Hadoop vendor) for packaging, testing, and evolving your Big Data platform.
3. The Diversification of Datastores
When Martin Fowler started to talk about polyglot persistence in 2011, the topic was still a rather abstract one for many people—although Turing Award recipient Michael Stonebraker made this point already in his 2005 paper “‘One Size Fits All’: An Idea Whose Time Has Come and Gone.” The omnipotent and dominant era of the relational database is over, and we see more and more NoSQL systems gaining mainstream traction.
What this means for your operations: anticipate the increased usage of different kinds of NoSQL datastores throughout the datacenter, and be ready to deal with the consequences. Challenges that typically come up include:
Determining the system of record
Synchronizing different stores
Selecting the best fit for the datastore to use for a certain use case, for example a multimodal database like ArangoDB for rich relationship analysis, or a key-value store such as Redis for holding shopping basket data.
4. Data Gravity and Locality
In your IT operations, you’ll usually find two sorts of services: stateless and stateful. The former include things like a Web server while the latter almost always is, or at least contains, a datastore. Now, the insight that data has gravity is especially relevant for stateful services. The implication here is to consider the overall cost associated with transferring data, both in terms of volume and in tooling, if you were to migrate for disaster recovery reasons or to a new datastore altogether (ever tried to restore 700TB of backup from S3?).
Another aspect of data gravity in the context of crunching data is known as data locality: the idea of bringing the computation to the data rather the other way round. Making sure your Big Data technology of choice benefits from data locality (e.g. Hadoop, Spark, HBase) is a step in the right direction; using appropriate networking gear (like 10GE) is another. As a general note: the more you can multiplex your cluster (that is, running different services on the same machines), the better you’re prepared.
5. DevOps is the New Black
The last trend here is not necessarily Big Data specific, but surprisingly often overlooked in a Big Data context: DevOps. As it was aptly described in the book The Phoenix Project, DevOps refers to the best practices for collaboration between the software development and operational sides of an organization. But what does this mean for Big Data technologies?
It means that you need to ensure that your data engineer and data scientist teams use the same environment for local testing as is used in production. For example, Spark does a great job allowing you to go from testing to cluster submission. In addition, for the mid-to-long run, you should containerize the entire production pipeline.
With the introduction of Big Data technologies in your organization, you can quickly gain actionable business insights from the raw data. However, there are a few things you should plan for from the IT operations point of view, including where to operate the cluster (on-premise, public cloud, hybrid), what your strategy is concerning open source, how to dealing with different datastores as well as data gravity, and, last but not least, how to set up the pipeline and the organization in a way developers, data engineers, data scientists and operations folks can and will want to work together to reap the benefits of Big Data.
Michael Hausenblas is a Datacenter Application Architect at Mesosphere. His background is in large-scale data integration research, the Internet of Things, and Web applications and he's experienced in advocacy and standardisation (World Wide Web Consortium, IETF). Michael frequently shares his experiences with the Lambda Architecture and distributed systems through blog posts and public speaking engagements and is a contributor to Apache Drill. Prior to Mesosphere, Michael was Chief Data Engineer EMEA at MapR Technologies, and prior to that a Research Fellow at the National University of Ireland, Galway where he acquired research funding totalling over €4M, working with multinational corporations such as Fujitsu, Samsung and Microsoft as well as governmental agencies in Ireland.
For more insights on workload and resource management, real-time reporting, and data analytics, get your free copy of the Guide to Big Data, Business Intelligence, and Analytics – 2015 Edition!