I recently gave a webinar with Forrester analyst, Richard Fichera. In his opening, he noted “the inexorable march to public cloud is underway.” And he’s right. We see it everywhere. But “march” is a critical word. It’s not a leap and it won’t happen overnight. Along the journey, many enterprises will implement hybrid cloud first.
As such, we see hybrid cloud increasing in popularity across the industry. For example, hybrid cloud adoption tripled in 2016, climbing to 57% from 19% of organizations surveyed, according to an Intel Security report.
Yet as hybrid cloud adoption widens, I’ve noticed that often there is still confusion on what it means for different parts of the IT infrastructure stack. So I thought it would be useful to walk through what we mean by hybrid cloud as it relates to storage, and software-defined storage (SDS) in particular.
Basically, when it comes to SDS products, there are two scenarios:
- Scenario #1: Tiering data to the public cloud. Here you have your data on commodity servers in your private data center running storage software that saves a copy to the public cloud, which can be restored if necessary.
- Scenario #2: Stretching clusters to the public cloud. Here you run instances of storage software directly in a public cloud, and then add those as additional, active nodes to your cluster running on commodity servers in a private data center.
Let’s discuss the pros and cons of each.
Hybrid Cloud Storage as a Low-Cost Tiering Strategy
The first scenario is where, in fact, the bulk of the market seems to be today. Essentially, there is a “southbound” object storage interface whereby the storage software puts the object into a public cloud such as AWS S3 or AWS Glacier. In this instance, the storage solution acts like a cloud gateway and can get and put data in a public cloud storage “bucket.” The pros of this approach are that it’s simple, extensible, and cost effective. It helps eliminate the need for a disparate cloud gateway solution and treats the cloud like a cold storage tier. Think of it as the modern equivalent of tapes.
However, the cons of this scenario are a bit more complicated.
Active vs. Passive Data
In this scenario, the data that you’ve stored in the cloud is not “active.” Because you’re storing just a copy of that data in the cloud, to do anything with it, you need to restore it first. In effect, the problem you’re solving here is one of disaster recovery (DR). Certainly, this makes sense in backup and archive use cases where you want multiple copies of that data should your data center go down.
As such, getting the data back so you can do something with it (e.g., run analytics to surface actionable insights) can be time-consuming and will most certainly incur cloud fees. Based on the cloud provider you’re using, there are considerable retrieval time and cost considerations associated with this data recall.
The vast majority of storage solutions boasting hybrid cloud capabilities fall into this first scenario. Essentially, it’s attaching to the cloud as though the cloud were another storage tier inside your on-premises storage solution, usually along the lines of tiering from flash to spinning disk and then to the public cloud.
There’s nothing wrong with the approach in this scenario, but it’s not how we think about hybrid cloud storage at Hedvig.
Hybrid Cloud Storage as a High Availability Strategy
Using us at Hedvig as an example, when we talk hybrid cloud storage, what we mean is that you actually run a full-fledged instance of the Hedvig Distributed Storage Platform in the public cloud. The difference, in our case, is that Hedvig is running on the compute side of the public cloud, such as Amazon EC2. We take these cloud servers running our software, virtualize and aggregate disk capacity (from a service like Amazon EBS, for example), and join these as active nodes in a hybrid cloud cluster. Here the data is active — it’s fully available to whatever application needs it and can be accessed locally in that public cloud.
Because the data is fully available, you can treat the public cloud as a full data center environment. You can fail applications over to that public cloud, burst to that public cloud, or create a multi-tier application with part of the app and its data running in that public cloud. In this scenario, the public cloud is not just a storage repository, but rather an active part of your overall IT stack. Moreover, because the application is pulling data from our software, which is running in the public cloud, it can exert more data efficiency control using various deduplication, compression, and caching capabilities inside our platform that boost performance. Not only does this dramatically improve availability — in fact, you can think of it as shrinking recovery time objectives (RTO) and recovery point objectives (RPO) to zero — but it enables you to easily migrate to and from public clouds to avoid lock-in and control costs.
But this approach may be more expensive because you now must permanently configure some compute capacity at the public cloud provider and have it run our software. At the same time, however, this additional cost is offset by the avoidance of building out additional data center capacity (e.g. a dedicated DR site) and eliminates some network traffic charges because our software can be very intelligent about data placement and volumes of data transmitted.
Why You’d Want “Active” Hybrid Cloud Storage
Let me give you two “for instances.”
For the first instance, let’s pick on Docker. Docker has quickly become a universal packaging (dare I say “virtualization”) technology for modern apps. And unlike hypervisors, it’s a de facto standard. You can be assured that you can run your Dockerized app anywhere, private or public, since everyone runs the Docker Engine. Now, if you’re running a stateful Docker application (e.g. with a NoSQL database like MongoDB or MySQL) in your private data center, and you also wanted to run the same Docker app in a public cloud, in this situation, you wouldn’t easily be able to move a container between those two sites. Docker is ephemeral in nature, and although things like the Docker Volume Plugin have solved shared storage, it hasn’t necessarily solved the problem of moving across cloud boundaries where that shared storage is no longer locally accessible.
For the second instance, let’s assume you now want to run an application where part of it runs on-premises and part in the public cloud. We already have a number of customers using the Hedvig Distributed Storage Platform in this way. One is a large online retailer that runs its e-commerce site entirely in the public cloud, but it has become prohibitively expensive to continue doing so. This retailer is pulling most of the website, including all of the databases and middleware tiers, back on premises. At the same time, the customer is leaving a portion of its public website running in the public cloud and uses our software as an abstracted data layer in between that allows it to take advantage of a public cloud like AWS. It can burst out to meet peak demand while all the heavy-lifting storage is done in the private data center. For companies running at moderate to large scale in public clouds like AWS, this saves money without sacrificing a highly available and elastic architecture.
Let Your Workloads Determine Your Architecture
So summing up: Scenario one is basically cheap, offsite storage while scenario two is a highly available architecture that could, for instance, survive a data center or cloud outage. The two are just different fundamental designs, but unfortunately, there is no standard definition of what hybrid cloud storage means.
To determine which approach is best for you, simply ask yourself: “Am I looking for a cost-effective tier for colder data, or am I trying to create a highly available app that avoids or even eliminates downtime?” Your answer will determine whether you’re interested in passive or active hybrid cloud storage.