If my experience with Hadoop in the cloud has taught me anything, it’s that it is very hard to get straight answers about Hadoop in the cloud. The cloud is a complex environment that differs in many ways from the data center and it's full of surprises for Hadoop. Hopefully, these notes will lay out all the major issues.
No Argument Here
Before getting into Hadoop, let’s be clear that there's no question that the cloud kicks the data center’s a** on cost for most business applications — yet, we need to look closely at why because Hadoop usage patterns are very different from those of typical business applications.
Interestingly, purely in terms of raw cycles per dollar’s worth of hardware, it is not at all clear that the cloud is cheap; you have to be precise about what you mean. As we’ll see below in an extreme case, were you to run large numbers of machines indefinitely at high utilization, you’d find that cloud cycles are actually quite expensive — likewise with certain data use patterns.
Nevertheless, cloud economics crush the data center for most business applications for several reasons:
- Only relatively rarely does cloud vs. data center boil down to cycles per dollar. Cloud providers are primarily selling VMs, not raw cycles. The goal of beating data center cost is a sitting duck for providers because data center utilization rates notoriously vary from abysmal to not very good. Ganging up many VMs per server allows the cloud provider hold the cost per VM down while still renting VM’s relatively cheaply. This is possible precisely because cycles are not the main product.
- The cost of the hardware is only part of the total cost of deployment, even if the raw cycles work out to be expensive. Where many enterprises spend the big money is the myriad extras such as rack space, networking, security, hardware maintenance and replacement, administration, backups, managing cool and cold storage, durable backups, IT department overhead, management overhead, etc., to say nothing of electricity, AC, and connectivity. Cloud providers throw most of this in for free, and the services are better quality than even many advanced IT departments can achieve.
- Even if the total cost of VMs wasn’t lower than running your own (and for general applications, it usually is) that’s still not all there is to it. AWS and the other clouds are incredibly nimble. You can spin up a new server in minutes, add load-balanced instances, or set up an entire virtual private cloud before lunch. You can shut them down just as fast. Getting that done in a data center through your IT department can take months and costs a fortune.
Yet, for all that, there do remain classes of computing applications for which the cloud cannot be assumed to be competitive. For instance:
- Applications where communication latency is critical. For example, round trip times from earth to the cloud are orders of magnitude too long to be competitive for securities trading. A round-trip to the cloud takes several tens of milliseconds, which is an eternity for traders who can win or lose trades by a microsecond.
- Applications that need close control of hardware resources (i.e., CPU, network, or storage-I/O). Sharing doesn’t simply reduce the resources a given user gets proportionately. It also reduces the total amount available because of inefficiencies it induces such as paging delays, cache flushing, extra disk seeks, breaking up smooth I/O streams, etc.
- Extremely resource-intensive applications. If your need for raw cycles and data is great enough, for long enough, it pays to cut out the middleman. This is especially true if the platform is highly consistent, which minimizes administrative complexity per node.
All of these — to some extent, but especially second and third — apply to Hadoop.
It Definitely Works
Hadoop clearly works at scale in the cloud. Netflix, for instance, reports that they operate thousands of cloud nodes that process staggering loads. At this writing, they process something like 350 billion events and petabytes of reads daily, numbers they expect to increase rapidly in the next few years.
The caveat is that it is far from clear that such a deployment is particularly efficient in terms of bytes-processed-per-node or events-per-dollar when compared to smaller earth-bound clusters. It might be or it might not be; it’s extremely hard to say. You can’t judge a technology from the success of a behemoth like Netflix because some jobs are so big that they can only be done inefficiently. That may sound like a perverse statement, but it’s actually common at many scales. Think of MapReduce itself. My 2010 laptop smokes any Hadoop cluster at sorting a megabyte of data. It’s not a contest. MR and Tez don’t come into their own until you’re into many gigabytes, but after that, their advantage increases relentlessly forever. Also, operating on that scale often involves major geographic issues that may not come into play at smaller scales.
Monster deployments also tend to be very complex, with many more moving parts than mere Hadoop. The companies operating at that scale usually work closely with the provider’s engineers as well as the staffs of their software vendors. Deciding whether the cloud will be cost-effective for your 30-node starter cluster involves an entirely different set of variables.
Apples and Oranges
For a typical business application, the burden of proof is on anyone who says don’t deploy in the cloud, but the decision is maddeningly complex for Hadoop.
There are two main reasons why it is difficult to predict in advance the cost-effectiveness of the cloud for Hadoop. Pretty much the same goes for other clouds, but I’ll talk about this in terms of AWS.
Firstly, Amazon instance capacity is defined in terms of a unit called vCPU, which has no clear definition and varies with the CPU type. Basically, you can think of it as equivalent to about half of the compute capacity of a hardware core on a processor of the type that underlies the instance. That’s a lame description, but you can talk all day and not come up with a better one. There are too many variables. Using vCPU to describe processing power is like using horsepower to describe vehicles: a Ferrari and an 18-wheel semi have approximately the same horsepower.
This is by far the best explanation of what a vCPU is that I have seen anywhere (courtesy of Marc Fielding on the Pythian Web site). It’s three years old, but I think it is still very sound.
For practical purposes, you can guestimate that to duplicate the power (whatever that means) of an eight-core data center server, you need 16-vCPUs running on the same family of hardware.
The second reason is that even given a rough measure of processing power equivalence, the properties of a data center cluster and a supposedly equivalent cloud cluster will differ in many dimensions.
- It’s not just raw cycles. As mentioned above:
- Workload type is a significant factor in hyper thread performance.
- Hardware resources (CPU, memory, disk-swapping, etc.) are shared with an unpredictable number and variety of users.
- The mounted block-storage models are quite different from disk.
- The S3 model, for better and worse, is completely different from anything in classic Hadoop.
- Cloud environments aren’t comparable to the data center counterparts.
- The relationship of the VM to the hardware is abstracted in the cloud.
- The inter-instance LAN is shared with users outside the cluster.
- S3 is an HTTP service and is not at all like disks.
- The devices backing EBS volumes are shared with other users and EBS itself is a complex service that does more than serving blocks.
- Important physical components that Hadoop uses in optimization are entirely abstracted away (i.e., racks.)
- The cloud supplies many things with an instance that you must take care of yourself in a data center deployment.
Even if the computational power you need in your cluster, over the duration of time you need it, is not cost effective in, the cloud may still be the right choice because:
- No matter how good your IT department is, deploying in the cloud is faster. You can deploy a major cloud cluster in hours. Hardware takes months just to buy and install.
- Managerial bias. There is a massive shift to the cloud underway, and expanding in the data center moves against the tide. Expect resistance to any project targeted to the data center. It’s a fact of business life.
- Hadoop doesn’t live in a vacuum. The economics of the cloud for the many other applications that use it or feed it may compensate for marginal economics in the cluster.
- AWS provides a very capable model for related items such as backing up data, resource availability, fail-over, etc.
While administration cost is lower in the cloud, beware that it can require more ongoing engineering and user support to get good performance.
EBS is a NAS, not a DAS. It reads and writes traverse a network. This affects performance in multiple ways:
- Reading across a network causes increased latency, which means the amount of time it takes to get the first byte, is greater regardless of the streaming bandwidth. Compared to streaming, latency is a secondary concern for Hadoop, but be aware that magnetic-backed EBS has a 10-40 MS latency, while SSD EBS is single-digit-MS. On a data center machine, SATA disk latency might be 10-15 MS and SSD latency would a tiny fraction of a millisecond.
- The underlying storage device that supports an EBS volume is shared among multiple volumes serving others within and possibly without your cluster. The number of users and what they are doing definitely affects both average latency and throughput, as well as the variance seen in those measures, especially if you or others sharing the resource are using transparent encryption on the back end.
Bandwidth and latency issues are hard to get a handle on.
Note that we’re only talking about AWS EBS-optimized nodes here. Optimized nodes provide a guaranteed amount of bandwidth to EBS. Without optimization, all the EBS traffic goes over the same network that your instance use to talk to each other, which limits you to a toy cluster.
Here are the main points you need to consider.
- The guarantee we’re talking about is for bandwidth to and from EBS, not for the amount of data you can actually fetch in a given time or with a given latency. These can vary for the worse by a significant amount depending on usage pattern, whether you use encryption, how many volumes you are accessing, etc.
- Maximum bandwidth to and from a given volume is limited by the volume type. These types have very different prices.
- Old-school magnetic-backed EBS is limited to about 40-90MB/sec.
- SSD is higher — up to about 160MB/sec (this is usually the default choice).
- Provisioned IOPS SSD can go higher — up to 320 MB/sec per volume.
- Aggregate bandwidth from a given instance to all of the EBS volumes that it has mounted is limited by the combination of instance type and vCPU count. Here is a chart that gives the numbers. For example, a typical server choice for Hadoop might be an m4.4xlarge with 16 vCPU’s. This would be equivalent to a pretty decent eight-core server and has a guarantee of 2000Mbps=250MB/sec aggregate bandwidth to EBS.
Note that 250MB/sec is about equal to the I/O bandwidth that a data center server will see from five commodity SATA drives. Such a machine might host twenty or more data disks, so you can get about four times the I/O bandwidth on the data center machine but at somewhat lower latency.
The latency differences between EBS and DAS disks are large and the source of much confusion. SSD-backed EBS is nothing like having SSD on the machine itself. (You can get this too but it is very expensive.) The latency of SSD-backed EBS is closer to that of a local SAS disk drive than it is to that of a local SSD disk, which for most purposes is negligible. The latency of magnetic-backed SSD is similar to that of an HTTP GET call. Figure three to five times the latency of a commodity SATA disk.
In fairness, EBS is a service and not simply a mounted file system. It does more than blindly serve blocks. It optimizes by merging sequential reads, does caching and pre-loading, implements transparent encryption, and provides higher availability than a disk.
On the other hand, EBS requires that a backlog of operations be enqueued in order to reach maximum throughput. In other words, getting the highest throughput depends upon a factor that tends to increase the latency.
S3 is a strange beast and not to be confused with a file system.
- S3 is a REST service accessed over HTTP, not a mounted block file system. The latency is large compared to either direct attached disk or EBS as the data makes many network hops to get to you.
- S3 has unbounded storage capacity.
- Performance can depend upon naming conventions in several ways.
- Compared to EBS, the cost ranges from cheap to extremely cheap. At its most expensive, S3 is an order of magnitude cheaper than EBS.
- S3 is optimized for IOPS, which is best for coarse-grained lookups for substantial data such as, for instance, web pages from a huge set. It has poor streaming capabilities and it is not a good way to access a random small bit of data (both of which are important to Hadoop.)
Noisy neighbors are a factor for many components including EBS, S3, LAN, CPU, etc.
Economic Pros and Cons of the Cloud
The economics of the cloud are complicated and highly dependent upon what you’re trying to accomplish even within the restricted concerns of Hadoop. Some of the differences are good, some are bad, and some are both.
Raw Instance Cost (Con)
Whatever the advantages of the cloud, contrary to its reputation, it is definitely not cheap in terms of raw CPU cycles/dollar. We’ll see below that $/CPU/hour is too crude a measure, but consider the following simplified example to get started.
As alluded to before, it’s hard to compare hardware with VM’s in the cloud, but 16 vCPU m4.4xlarge costs $0.862/hour, or $20.68/day, which is $7,551/year. EBS volumes cost $0.10/GB/month (for the recommended default SSD-backed storage) so if your big server has 8TB of EBS, that’s $9,600/year for disk, for a total of $17,151/year. Note, that’s only 2.6TB of replicated storage per node, which is quite small for Hadoop, but we’ll assume a more complex cold-storage policy than you might have on a data center cluster.
Less powerful machines are cheaper, for instance, an m4.2xLarge with 8 vCPU’s costs $0.431/hour, but you need twice as many to get equal power or the same level of guaranteed bandwidth. The vCPU’s for a given type generally work out to about the same price.
That same $17,151 will buy a very nice commodity server outright, including the server’s share of a rack, cabling, switching, etc., so the raw cost of the hardware on AWS is something like 5X as much as the purchase price of an approximately equivalent machine with many times as much storage, even without working in the depreciation, which can offset a large part of the nominal purchase price.
HDFS Storage Cost (Con)
This one was a surprise to me.
It’s typical to spend more on EBS volumes than on instances. In the above example, we’re paying $9,600/server/year for the disk space alone.
Commodity SATA drives cost less than $100/TB. Not $100/year, but $100, period. Over five years, that would be 0.00833 of the cost of EBS—less than 1%.
For a concrete example, take TrueCar’s often published claims about storage cost on their terrestrial clusters. They reported three years ago that they pay $0.23/GB for storage in their data center Hadoop cluster. That’s replicated, so it’s $0.015/GB/year for raw data, compared to AWS’s current price of $1.20/GB/year, or about .0125 the cost of cloud storage — not too different from the estimate above, even if you don’t allow for three years of technology improvements.
Data center nodes typically have a lot more storage, too—as of 2017, a Hadoop server might have 24 two-TB drives each. That much HDFS storage would cost $57,600/year on AWS!
Regular S3 costs $0.023/GB/month, but TrueCar’s cost for SATA works out to $0.0038, which is about 1/6 the cost. Not until you get to their Glacial S3 storage, at $0.004 does AWS beat commodity SATA disks in the data center. S3 is only semi-online storage, so the name may refer to the speed of access as much as to the data temperature.
As mentioned above, the data center box has about four times the throughput, so that data can be served up by less capable servers as well because there is less need for compression and other CPU intensive steps to conserve I/O bandwidth.
Power and Rack Space (Pro)
AWS is supplying the power for cloud deployments, as well as the cooling, rack space, hardware installation and physical security, probably with higher reliability than any lesser organization can hope to achieve.
Power and Rack Space (Con)
Say your server uses 250W — at $0.20/KWH, that’s $438/year. Double that to cover cooling, it’s only $876/server/year, which is not a large number compared to the hardware cost.
Even if rack space cost in your own data center is equal to the entire hardware cost (which would be a lot), the total cost of hardware, power, and rack space is still under $13,000/year.
Management, Administrative Skills, User Skills (Pro)
Managers often forget the overhead of management itself when assessing relative cluster costs. Getting a data center project approved and then shepherding the decisions on machine types, space, work scheduling, etc. can consume an enormous amount of time for many months. Management hassles from up the hierarchy are also often much larger for a data center project than the analogous process for a cloud project.
A new hardware or software platform also takes hardware skills, networking expertise, cluster security skills, Linux admin skills, conceptual understanding of the platform and its goals, etc. You need to be pretty big for the necessary expertise to be a commodity item that is amortized across many machines.
Say the 12-node cloud cluster for your pilot is costing you $200K/year in AWS bills. That sounds like a lot, but that’s less than the net cost of one person, be it a manager or an engineer.
All in all, the human cost of setting up small clusters can easily exceed the concrete costs.
Time to Production and Opportunity Cost (Pro)
This is the silver lining you’ve heard about. Setting up your own hardware cluster takes a lot of money and time up front. Just getting buy-in from the IT folks about what machines to purchase can be an ordeal, particularly because the ideal machine for Hadoop is quite different from what you’d choose to support typical corporate applications. In general, Hadoop requires an entirely different sensibility and educating the IT people is not a minor undertaking. It’s not just conservatism; there are significant up-front and ongoing costs to IT for each platform type it has to support.
In the cloud, a much narrower set of skills can spin up a Hadoop cluster almost on demand. If you’ve done it once or twice before, you can set up a 30 node cluster in an afternoon. Not just the machines; I’m talking about going from a standing start to executing Hive queries.
Particularly for pilots and smaller clusters, the ease of deployment and the savings in opportunity cost can outweigh all other considerations, especially in the startup world where days may count.
Finding Experts (Pro and Con)
You can get handy with the cloud pretty fast, so there are lots of experts. On the other hand, the mix of Hadoop and AWS is so complex, with so many variables, that relatively few of the experts actually know much.
The VM Model (Con)
Hadoop was developed for deployment over Linux running on bare metal. Cloud deployment implies virtual machines, and for Hadoop it’s a huge difference.
As detailed in other articles (for instance, Your Cluster Is an Appliance or Understanding Hadoop Hardware Requirements), bare-metal deployments have an inherent advantage over virtual machine deployments. The biggest of these is that they can use direct attached storage, i.e., local disks.
Not every Hadoop workload is storage I/O bound, but most are, and even when Hadoop seems to be CPU bound, much of the CPU activity is often either directly in service of I/O, i.e., marshaling, unmarshaling, compression, etc., or in service of avoiding I/O, i.e., building in-memory tables for map-side joins.
In the old days, NAS was almost unusable for Hadoop, but modern virtualization platforms do better. AWS, in particular, does an amazing job of mitigating one of the biggest shortcomings of NAS by providing dedicated bandwidth from instance to EBS on alternate networks. The pipes, however, are still limited in size and fall far short of the bandwidth achievable across the server bus.
The second big disadvantage of VM’s (for Hadoop) is that they abstract away many of the details that would be used for optimizations in a terrestrial cluster. Hadoop is designed to take advantage of the predictability of a block-oriented workload to avoid paging and GC delays, keep pipelines and caches full, TLB buffers from flushing, etc. Shared virtual environments obviate much of this under the normally reasonable assumption that the resources are grossly under-used anyway, which unfortunately does not apply to Hadoop.
Sister Applications (Pro)
Even if Hadoop is marginal in the cloud in your particular case, the cloud is often a superb environment for many of the other applications that probably go with it. For many of these applications, it has easy, almost unlimited scalability and cloud storage models that have excellent properties for those applications (i.e., S3.) Having some services in the cloud and some on the ground can have large costs for data transfer as well as being a performance hit.
Ancillary Services (Pro)
Site security, network security, transparent encryption, backup to other regions, cool and cold storage modes, communication among geographically remote VPCs, networking among VPCs, seamless extension to a physical data center, and innumerable similar capabilities would be hard to match at any price in a local data center.
Memory, Machine, and Rack Awareness (Con)
There are a number of other ways than I/O in which Hadoop is tuned for conventional deployments on bare metal. Rack awareness is one. Hadoop considers the racks that nodes reside on both when reading and when writing/replicating data in order to take advantage the top-of-rack (TOR) switch in preference to the globally shared LAN.
Likewise, Hadoop will try to take advantage knowledge of what drive a block lives on so that it can read or write data locally, avoiding even the TOR switch. This concept does not exist with a NAS.
Likewise, in a conventional deployment, the number of mappers and the size of the containers are chosen on the assumption that very little other than Hadoop will be happening on the node. This keeps the data pipelines flowing and minimizes cache and TLB misses, page faults, etc. None of this works smoothly with noisy neighbors. You can’t do this on shared hardware.
S3 (Pro and Con)
One thing about S3 that often comes as a surprise to Hadoop users is that S3 does not underlie HDFS, in the way that SATA, SAS, SSD, or even NAS storage does. S3 is a drop-in replacement for HDFS. (You can use S3 and HDFS side by side, of course.)
S3 isn’t really a filesystem at all. It is a sharded key-value data store that is accessed via a REST API. It is optimized for fast lookup of random data items of medium size. What looks like path names in S3 is actually just a naming convention applied to keys that are in reality just strings. The apparently hierarchical naming does not correspond to the underlying storage in the sense that it does in a file system. There are no chains of I-nodes.
S3 can do far more I/O operations per second than a disk can, which makes it an excellent backing store for say, large web site serving pages, but it is not built for streaming data. Even if it were, as an HTTP service, an application interacts with it over the LAN that is shared with all the instances, traffic to the Internet, etc. That alone would cap it at a modest overall rate cluster-wide.
One consequence of its REST-service nature is that it doesn’t mount on any particular machine. Your entire cluster sees the same buckets. This is good and bad, because, while it makes S3 data’s lifespan independent of any particular node, the cluster, or event the VPC, inserted data is not guaranteed to be instantly consistent to all users. A node can write data that another user won’t see for an indeterminate amount of time.
The maximum bandwidth an individual reader can get from S3 is tiny by Hadoop standards—a few megabytes per second. You can have lots of readers, but it’s still very limited for streaming from the point of view of any particular reader.
We touched on costs above. The good news is that the raw cost per GB is about 1/3 that of EBS. The better news is that there is no HDFS replication for S3 data, so the crude cost is more like 1/9 that of EBS. One other peculiarity is that S3 never loses data. You can do something stupid and lose it yourself, but the service is reported to have never lost data due to an internal failure in the entire history of the product.
Unfortunately, while S3 is great for storing massive amounts of data that will rarely be accessed, the severe bandwidth limitations make it a poor fit for the typical Hadoop job at scale or for low-latency queries. It tends to get used for landing data for ETL, and for long term storage.
The other thing about cost is that while it looks great compared to EBS, EBS is about 100x more expensive than data center disks. So, contrary to what one hears, S3 (even at 1/9 the cost of EBS) is actually rather expensive compared to disk. Also, while it’s free to write into it from the world, you pay to read it if you aren’t in the cloud. The price to read is currently around $50/TB.
Data Formats (Con)
Some of the most powerful optimizations in the Hadoop universe are built around the optimized row-column format, aka ORC files, and ORC plays relatively poorly with S3. It works better with HDFS over EBS, but not as well as over a native file system and non-virtual machines because of the many low-level optimizations that go with it. Parquet is said to work better in the cloud. Apache and Hortonworks are hard at work on improving the performance of ORC in the cloud, particularly over S3.
Limits on Node EBS-I/O Capacity (Con)
As mentioned above, price limits the amount of EBS one can use. Beyond that, nodes cannot mount an infinite number of EBS volumes. The number varies with the OS, but figure eight volumes for RHEL.
Also, as discussed above, each node also has a limit on network bandwidth it can achieve with EBS. To be fair, the limit is also a guarantee.
Not Paying for Idle Time (Pro)
This is one of the big selling points for AWS. The cost arguments above implicitly assume you’re using the VM’s around the clock 365 days per year, but not many computers are used that way in the typical enterprise. In theory, you can reduce the number of compute nodes you use seasonally, on weekends, or even daily for the after-work hours.
Not Paying for Idle Time (Con)
This strategy is probably only practical if your data is backed by S3, where the data is not local to any particular node. Such nodes have minimal state and can be shut down and spun up again easily.
For HDFS, assuming that in most use-cases you want to keep the all the data, there will usually be far too much data to move around if you are removing a significant number of nodes. (See A Question of Balance.) In a nutshell, keeping enough free space on remaining nodes so that you can simply move the data from the deleted nodes defeats the purpose of the exercise because the volumes cost more than the instances.
Schemes to shuffle the data from the decommissioned nodes to the remaining nodes with minimal extra space work logically, but incur even more data movement, which ferociously consumes the two most precious and limiting resources in the cluster: I/O bandwidth to/from EBS and I/O bandwidth on the LAN that connects the instances.
For some use cases, it is possible to keep most of the data on S3 after ETL and move just the working set down to HDFS for a limited time. This can work, but the transfer times are large and may cut in on cluster performance and/or availability.
There are other possible strategies, but it is a rare scenario in which it makes economic sense to do this.
One great way to get started is, rather than fiddling around with tuning your own cluster, try your ideas out on Amazon Elastic Map Reduce (EMR). This will be engineered for you about as well as it could be, so you can get a good feel for what the real resource requirements are, how S3 and EBS compare, etc.
What I have seen in practice follows. None of this is authoritative, but I think my experience is consistent with what a lot of people report for modest size clusters (i.e., less than 100 nodes.) You may find some of it hard to reconcile with many of the glowing reports you’ll find on the Web, but please remember that these often concern the extremes: either test clusters with unrealistically few nodes or mega-monsters like Netflix where it’s so much more complex than vanilla Hadoop that it may not really apply to you.
- It is unbelievably easy to deploy clusters in the cloud. With a little experience (like the second time you do it) even a non-admin can deploy a basic Hadoop cluster in a couple of hours.
- Ambari makes the Hadoop part easy.
- You can take advantage of blueprints to make repeated deployments easier.
- You can completely automate the larger process with Ansible to make the entire infrastructure deployment a cookie cutter operation (but it’s a lot of work if you aren’t going to do it many times).
- Even with more advanced enterprise security such as Kerberos integration, and more advanced networking, deploying is still not difficult for a system administrator.
- There is no up-front cost to a cluster. It is pay-as-you-go (unless you make special deals).
- A host of related applications, both Apache and otherwise, are quite cloud-friendly and in many organizations may already be in the cloud, which is natural for integration.
- No doubt about it, S3 query performance is very poor and HDFS over EBS isn’t great. You may be able to mitigate this somewhat, depending on what you are doing and what your demands are, but there is no way that an N-node EBS-backed cluster will compete with N-nodes on the ground. Not even close.
- Whatever the number of nodes it takes to reach your desired performance level, they will be expensive per node if you plan to run the cluster indefinitely.
- Throughput can be adequate for batch, ETL, etc. — yet the cluster may still show latencies that are a disappointment for interactive users. (Down the road, LLAP may mitigate much of this problem for many users.)
- Storage cost in the cloud comes as a big surprise for most users. It will probably be EBS, rather than the nodes that are the dominant cost. Even S3 isn’t as cheap as it’s cracked up to be. Storage cost really adds up.
- Mitigating storage costs by shifting data around has a significant cost in network and EBS bandwidth. It’s a partial solution at best.
- Getting data into S3 is free, but getting it out again costs money — about $50/TB — and takes a lot of time. Big users may find themselves effectively locked in by the sheer time it takes to move data out.
For startup companies and new projects, you can’t beat the low barrier to entry. For POCs and experimenting, it’s a dream. Per hour, a good sized cluster can be had for a lot less than you’d pay an architect to noodle around, so a lot of arguments can be settled by setting up a cluster and trying it.
The multi-dimensional nature of the price/performance comparison makes it hard define equivalent clusters exactly, but by any reasonable measure, for a given amount of compute power, a large Hadoop deployment the cloud is very expensive.
- The cycles cost significantly more.
- The data costs many times more and you need more compute cycles per byte.
- Getting data to and from applications that are not in the cloud is often a major bottleneck and it costs money.
- Any savings are going to be in things like time-to-deploy, administrative costs, whether the money comes out of capital or operating expenses, etc., and not in the cost of the processing power and certainly not in the cost of data storage.
Hadoop in the cloud tends to perform better for traditional Hadoop workloads than for interactive responsiveness, where the latency and slower transfer rates slow down job turnaround. This is bad for internal politics because the most visible processing is probably going to be the area where it is least impressive.
Cloud clusters struggle with performance more than data center clusters. I can’t prove this is true for everyone, but it’s what I’ve seen. It takes more skill to use the cloud effectively — not just back-end expertise, either. Your ordinary user programs may need significantly more attention from experts to get the performance that the cluster is capable of.
Bottom line: for startups, new lines of business, temporary clusters, etc., the cloud can’t be beat — but anyone deploying larger clusters (fifty data center node equivalent and up) that will be around a while should give the cost in the cloud a long hard look before committing. Be particularly careful about anticipated data storage needs.