Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Hadoop and Spark on Docker: Ten Things You Need to Know

DZone's Guide to

Hadoop and Spark on Docker: Ten Things You Need to Know

Thinking of building your own containerized solution for large-scale big data analytics? Stop. Just don't do it. Here's why.

· Big Data Zone
Free Resource

Access NoSQL and Big Data through SQL using standard drivers (ODBC, JDBC, ADO.NET). Free Download 

For a while now, I've been struggling to understand why any enterprise would want to build their own solution for large-scale deployments of Big Data workloads like Hadoop and Spark on Docker containers. The arguments for "doing it yourself" (DIY) often play like a broken record:

  • "If they <insert name of humongous tech giant here> can do it, we can do too."
  • "It's all available in open-source tools."
  • "We don't want vendor lock-in."
  • "No one understands our data needs."
  • "No one understands our security/workflow/performance requirements."

Every enterprise makes the build vs. buy determination for each big software project they undertake. The specific steps they take in this determination may differ, but they haven't really changed that much over the years. The steps typically go something like the following (as defined in an article from several years ago called Buy vs. build: Six steps to making the right decision):

  1. Validate the need for the technology.
  2. Identify core business requirements.
  3. Identify architectural requirements.
  4. Examine existing solutions.
  5. Do you have in-house skills to support a custom solution?
  6. Does a COTS (commercial off-the-shelf) solution fit your needs?

But big data is different. Technologies in the big data ecosystem are morphing and emerging by the month; business requirements for big data are still evolving; analysts and data science teams always want to try out the latest and greatest new open-source tools; and the infrastructure and operational requirements are changing constantly, too! The emerging ecosystem around Docker containers is also evolving at an extraordinarily rapid pace and as enterprise adoption of containers becomes more widespread, there are constantly new options and considerations to take into account.

So when it comes to running big data workloads on Docker containers in the enterprise, trying to follow the steps above for a build vs. buy decision just won't work. Any efforts to define the business and architecture requirements today will likely be obsolete, or at the very least greatly changed, by tomorrow. It's tempting for an enterprise to take this type of project in-house so that they can supposedly maintain a maximum degree of flexibility, to pivot with new technology developments and evolve with the latest business requirements. But before you take that trip down the DIY road, let me ask you one question: "Would you build your own car?"

Seriously. Your car is a technological marvel, constantly being placed under new demands, and essential to your lifestyle. Enterprise big data initiatives are equally complex, required to meet changing demands, and are critical to the business. So in this blog post, I'm going to look at some of the things to keep in mind when trying to build your own car. Perhaps they can provide some insights for the steps required to deploy and manage big data workloads on Docker containers in the enterprise:

In fact, I found an article here which highlights the Ten Things You Need To Know Before Building Your First Project Car." Interestingly enough, most of the ten things apply equally well to the build vs. buy decision for a containerized big data solution.

10. Look Out for Rust

With respect to big data, this means: don't use last year's (or last month's) technology. Choose your components wisely. The recent past is littered with failures. From a Big Data standpoint, that means you need the ability to "future-proof" your deployment: you wouldn't want to lock yourself into one Hadoop version or even one Big Data toolset (the emergence of Spark as an alternative to MapReduce is just one recent example). You'll need the flexibility to adapt to new options, new versions, and new innovations in the Big Data ecosystem.

From an infrastructure standpoint, one need look no further than OpenStack. A DIY solution built on OpenStack is very likely undergoing a complete re-write just as I write this blog; in fact, if you tried to run your Big Data workloads on OpenStack in the past, the odds are that you're now looking at containers instead.

9. Buy Good Tools

In this context, I'm not going to refer to open-source software or even commercial software applications in the big data industry as the "good tools" you'll need. I'm going to use it to extrapolate to the people you'll need on your staff. The data engineers and data scientists; the Big Data architects and specialists. They are in short supply, and they are expensive. If can you find them, can you afford them?

That big hyper-scale technology company that used open source software and their own internal team to successfully deploy a large-scale Hadoop or Spark implementation running on Docker containers probably spends tens of millions of dollars each year on staffing alone; they need to employ the very best engineers in the business in order to keep that platform up and running. Most enterprises can't afford that kind of staff, even if you could hire them.

8. Adding Speed Isn't Cheap

And it is not just speed (known as performance in big data parlance). When it comes to deploying an enterprise-ready big data solution, nothing is cheap. Performance isn't cheap. Security isn't cheap. High availability isn't cheap. Open-source tools may sound free, but they're free like a free puppy (i.e. all the care and feeding required isn't free). It all costs money — especially if you're trying to do it yourself. How fast and reliable you want it (i.e. how performant, secure, and highly available) will ultimately depend on how much you want to spend (i.e. in time, effort, and skilled resources).

7. Tag and Bag Everything

This means you absolutely must sweat the details up front. Big Data project failures are more often than not predicated by the statement: "We will do this bit now, and figure the rest out later". But you need to begin with the end in mind.

You need to know the performance that you'll be able to deliver and what your requirements are. You need to know how to integrate with your corporate Active Directory, and LDAP, and Kerberos services. You need to know your network topology and security requirements as well as the required user roles and responsibilities breakdown. You need to know how you'll handle high availability, QoS, and multi-tenancy. You need to know how you'll manage upgrades to the latest versions of your Hadoop distribution or other big data tools, and how you'll respond to requests for new big data frameworks and new data science tools. If not, you're just asking for trouble.

6. Someone Will Build Something Better (That's OK)

This is where building a solution for large-scale Big Data analytics in the enterprise differs from building your own car. If you're building your own car, you're probably doing it in large part for your own satisfaction and enjoyment. But the Big Data analytics used by your analysts and data science teams are becoming imperative to run your business. If this breaks down, it will likely have significant implications for your business. And that's not OK.

If your enterprise organization's business competition has a better solution for their Big Data analytics, it means they may have better insights to help drive business innovation - and that can have serious consequences in terms of lost business opportunities, customers, or revenue.

5. Don't Have a Project Car as Your Only Car

As mentioned above, this endeavor is not a college class (where failure means a low grade) or a personal pet project (where success is measured by own satisfaction). This is about your business. Failure can have a significant impact on your organization's top line or bottom line. You cannot afford to take risks when it comes to the big data analytics and infrastructure that are critical to your business. You need to be sure that it will work.

4. It's Only Cheaper If It Works, Not If You Have to Do It Twice

This cannot be over-emphasized, but it is something that's frequently overlooked. As was mentioned in point #10, changing technology stacks is expensive. You need to be absolutely certain that not only will what you build work correctly the first time, but you also need to make sure it can support all your Big Data needs in the future.

3. Do Your Homework

You need a solution that will support a large-scale deployment of big data workloads like Hadoop and Spark on Docker containers. Before you start, be sure to investigate all options: you can build it yourself from scratch, you can build it from parts and blueprints (a kit car, anyone?), or you can buy it pre-built and ready to go. By now you may be a bit reluctant to really build that system from scratch. But maybe you're still considering a kit car — i.e. building it out of readily available parts and toolsets. Let me offer some simple words of caution here: it won't work.

For example, I've met with several enterprise IT organizations that have tried using container orchestration tools (like Kubernetes or Mesos with Marathon) to build out their own containerized big data environments. But container orchestrators are just that: container orchestrators. These organizations ultimately realized there were a lot of gaps and missing parts when deploying and running multiple big data clusters in containers at petabyte scale.

The rationale for using a kit car is that all the pieces are provided by a single manufacturer and they are guaranteed to fit together into a working whole; often using nothing more complex than a screwdriver and torque wrench. So some IT shops may assume that those open source container orchestration tools have already done all the heavy lifting for orchestration. And since those tools work well for stateless apps, they assume it will be easy to assemble the available parts and make it work for stateful apps like big data workloads.

But in their current state, that is certainly is not the case with today's container orchestration tools. What is available today is more like rough blueprints — you need to scrounge up the parts you can, manufacture the ones you can't, use a bigger hammer when they don't quite fit, be prepared to do a complete teardown and redesign right in the middle of the final assembly when one (or more) of your assumptions doesn't pan out, and hope the whole thing won't catch on fire when you take it for the test drive. And for Big Data workloads like Hadoop and Spark, there are many other considerations that need to be taken into account. I recently co-presented a webinar where we discussed this topic in depth; you can watch the webinar replay here or at the bottom of this blog post.

2. It'll Take Longer Than You Think

Murphy's Law applies here, and so does Hofstadter's Law: "It always takes longer than you expect, even when you take into account Hofstadter's Law."

1. It'll Cost More Than You Think

As cited in the article about building a project car: "Price out what you think the project will be and double it."

These last two tips are true for every project, no matter if it is building your own car, remodeling your kitchen, or developing your containerized platform for Big Data analytics. Always look for ways to reduce time and cost, and one of the best ways to do that is by reducing risk. Not only will it take much longer and cost much more than you think it will, but there is also a high risk of failure in doing it yourself.

One final point: your DIY project car won't have Tesla's "Insane Mode," so you certainly can't expect it to go zero to 60 miles per hour in three seconds. Just as you won't be able to recreate Elon Musk's innovations when building your own car, you simply won't be able to replicate the patented innovations that we've built here at BlueData (i.e. our IOBoost functionality for boosting I/O performance and our DataTap technology for enabling compute/storage separation). For example, we've proven you can get bare-metal performance while running big data workloads on containers with our platform; but you won't get that with a DIY solution.

Still thinking of building your own containerized solution for large-scale big data analytics? Just don't do it.

The fastest databases need the fastest drivers - learn how you can leverage CData Drivers for high performance NoSQL & Big Data Access.

Topics:
big data ,hadoop ,spark ,docker ,containerization ,big data analytics

Published at DZone with permission of Tom Phelan, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}