Addressing the Complexity of Big Data With Open Source
Take a look at this article from our recently-released Open Source Guide that explains how Bigtop helps reduces open source complexity.
Join the DZone community and get the full member experience.Join For Free
Simple software is a thing of the past. Think about it: No program out there is created in a vacuum. Every program uses libraries, has run-time dependencies, interacts with operational environments, and reacts to human inputs. Free and opensource software, as a creative free-market approach to software development, provides more than one solution for every challenge. There are multiple compilers, operating systems, statistics packages (known today as machine learning), test frameworks, orchestration solutions, and so on. Each project moves at its own speed, releasing new features and adding new attributes. Imagine for a second that there is a need to combine a few of these complicated projects into a meta-complex system. It sounds quite sophisticated, doesn't it?
The Big Data Zoo
Just like a zoo with hundreds of different species and exhibits, the big data stack is created from more than 20 different projects developed by committers and contributors of the Apache Software Foundation. Each project has its own complex dependencies structure, which, in turn, build on one another very much like the Russian stacking doll (matryoshka). Further, all of these projects have their own release trains where different forks might include different features or use different versions of the same library. When combined, there are a lot ofincompatibilities, and many of the components rely on eachother to work properly, such as the case of a sofware stack. For example, Apache HBase and Apache Hive depend on Apache Hadoop's HDFS. In this environment, is it even possible to consistently produce software that would work when deployed to a hundred computers in a data center?
Reigning in Complexity
A software stack is an ultimate bundle that can be given to a customer as a package or a container image and is defined by a bill of materials specifying the exact names, versions, and patch level for each component. Customers expect that each individual software project has been carefully selected and altered so that it works with the rest of the distribution, and that everything has been validated at the integration and system level.
Controllability has the power to lower the cost of development. That’s why we use version control, the CI/CD process, and automated deployments.
Fortunately, this issue has already been resolved in operating systems. Debian Linux is probably the most famous, as it has allowed for the creation of many derivatives and offspring over the years. By understanding the principles required to create complex software systems like an operating system, is it possible to build a solution that creates a complex stack based on Java, C, or scripting languages, or a combination of them? This was the question that a number of experts attempted to address in producing and deploying Hadoop's stacks in the early days of the platform.
Controllability has the power to lower the cost of development. That’s why we use version control, the CI/CD process, and automated deployments. Apache Bigtop was conceived as a framework that would provide and enforce the best practices and operations of software development to the lifecycle of any complex stack. Bigtop graduated to an ASF top-level project in 2012. At a high level, it consists of the following pieces: packaging code, deployment code, a Docker-based mechanism to create clusters for development and testing, distributed integration test framework, tests for stack validation, and a build system written in Gradle (a DSL of Apache Groovy). The development lifecycle looks like this: write and commit the code (either for Bigtop itself or for one of the components referenced by its bill of materials), run the build, fire up a cluster (either Dockerized or real if you have spare hardware handy) using the provided Puppet recipes for deployment and configuration of the fresh packages, run the tests, and analyze the results. Repeat as many times as needed.
All of these moving parts effectively serve one purpose: to create the packages from known building blocks and transfer them a different environment (dev, QA, staging, and production) so that no matter where they are deployed, they will work the same way.
Does this sound familiar? You bet! This is pretty much how any well-constructed software pipeline is done today. There is one key difference, though: Bigtop users are dealing with highly complex systems that require some serious orchestration. It isn't exactly your grandpa's mobile app. Yet, Bigtop makes the process of a software stack creation so seamless that I was able to produce a fully equipped and tested commercial distribution including Hadoop, HBase, Hive, and more in under four weeks with the help of a small team of engineers.
All of these moving parts effectively serve one purpose: to create the packages from known building blocks and transfer them a different environment (dev, QA, staging, and production) so that no matter where they are deployed, they will work the same way. The deployment mechanism needs to control the state of the target system. In other words, this means that if you use the same source code and run the same build and deployment over and over again, you will end up with the same results every time. Relying on a state machine like Puppet or Chef has many benefits. You can forget about messy shell or Python scripts to copyfiles, create symlinks, and set permissions. Instead, you define "the state" that you want the target system to be, and the state machine will execute the recipe and guarantee that the end state will be as you specified. The state machine controls the environment instead of assuming one. These properties are great for operations at scale, DevOps, developers, testers, and users, as they know what to expect.
Building the Ecosystem
Now you know why Bigtop is the framework used by all vendors of the Hadoop ecosystem. Amazon EMR, Google Dataproc, and others are using it extensively. ODPi has chosen Bigtop as its foundation of the industry's first reference architecture for a Hadoop-based stack. It's hard to overestimate the importance of open standards in the model software industry. If you're an application developer, your costs would increase significantly if you had to deliver your product to two different platforms. Wouldn't it be cool if you could certify against a reference architecture and then your code would work on all compatible clusters? That's exactly what Bigtop will help you achieve while also reducing the cost of development and production.
Opinions expressed by DZone contributors are their own.