Ensuring SQL Server High Availability in the Cloud

DZone 's Guide to

Ensuring SQL Server High Availability in the Cloud

High-performance cloud file shares, disaster recovery in cloud databases, and the high availability numbers that matter in SQL Server deployments

· Database Zone ·
Free Resource

Theoretically, the cloud seems tailor-made for ensuring high availability (HA) and disaster recovery (DR) solutions in mission critical SQL Server deployments. Azure, AWS, and Google have distributed, state-of-the-art data centers throughout the world. They offer a variety of SLAs that can guarantee virtual machine (VM) availability levels of 99.95% and higher.

But deploying SQL Server for HA or DR has always posed a challenge that goes beyond geographic dispersion of data centers and deep levels of hardware redundancy. Configuring your SQL Server for HA or DR involves building a Windows Server Failover Cluster (WSFC) that ensures not only the availability of different machines running SQL Server itself but also — and most importantly — the availability of storage holding the data in which SQL Server is interacting.

In a traditional WSFC, data may reside in a storage area network (SAN) or an SMB3 share accessible to all server nodes in the WSFC. But cloud storage cannot be shared in the same way as a traditional SAN. That limitation has forced organizations to use either a third-party approach that can overcome the shared storage constraints in Windows or embrace one of the two Windows-native approaches that work around the challenge of shared storage in different ways.

However, there are murmurs of a new storage trend in the cloud: emerging support for high-performance cloud file shares, which act in a manner similar to traditional on-prem file shares. The real question, though, is whether a cloud file share option, even a high performance one, presents a better path to ensuring HA and DR in the cloud.

HA and DS: The Numbers That Really Matter

When we talk about HA, we’re talking about guaranteed availability of 99.99% or greater, and despite the data center build-outs described above, that’s still a hard number to guarantee. It’s also a number that you have to consider carefully.

You can cluster two or more VMs in separate racks in an Azure data center (in what’s called an “availability set”), which you’ll be guaranteed that at least one of those VMs will be available at least 99.95% of the time. Alternatively, Azure and AWS enable you to cluster VMs across multiple data centers (i.e. “availability zones”), which will get you an SLA guaranteeing the availability of at least one of the VMs at least 99.99% of the time.

However, those SLAs only guarantee the availability of a VM. They don’t guarantee the availability of SQL Server or your SQL Server data. If a primary VM goes offline and your cluster fails over to a secondary VM, your secondary VM needs to be running SQL Server and needs to be able to access the underlying database files for your operations to continue with minimal interruption. Ensuring the availability of SQL Server and the underlying data requires further configuration.

This is an excerpt of DZone's 2020 Cloud Database Trend Report.
Read the report for more:

Get Serve(r)d 

Configuring for Software Availability in the Cloud

So how do you ensure the high availability of SQL Server and your data in the cloud? If you want to rely on services that are native to Windows Server (as opposed to using a third-party product), you’ve only had two options to date. You can use Storage Spaces Direct (available in Windows Server Enterprise Edition 2016 and later) or you can create an AlwaysOn Availability Group (available in SQL Server 2012 Enterprise Edition and later).

Both approaches have advantages and disadvantages. Storage Spaces Direct (S2D) creates a virtual storage area network (SAN) in software that can be accessed by any of the VMs in a WSFC. This sounds like a cloud-based version of a traditional failover cluster. But because this cluster must be configured as an Availability Set, all the VMs in the cluster reside in the same data center.

If the entire data center goes offline because of a disruptive event, then all of your VMs and all of your data will go offline with it. That’s why you won’t ever get more than a 99.95% availability guarantee with an S2D configuration.

The single data center requirement of S2D also eliminates the possibility of deploying a highly available SQL Server FCI that spans Availability Zones in Azure, AWS, or even Google Cloud Platform’s “Zones,” which can also span multiple data centers.

Table 1: Advantages vs. Disadvantages of Windows Server Options

Advantages Disadvantages
Storage Spaces Direct (S2D)
  • Virtual storage area network (SAN)
  • Can be accessed by any of the VMs in a WSFC
  • VMs reside in the same data center
  • Data goes offline with disruptive event
  • Only a 99.95% availability guarantee
  • Cannot deploy a highly available SQL Server FCI that spans Availability Zones
AlwaysOn Availability Group (AG)
  • Can support AG replicas among geographically distinct data centers
  • Provides services that automatically synchronize SQL Server data amongst replicas
  • Can reach up to 99.99% availability
  • None of the databases are protected in case of failure
  • Cannot replicate key system databases
  • AG has not been tested beyond 100 SQL Server databases or 10 AGs

In contrast, an AlwaysOn Availability Group (AG) can support AG replicas among geographically distinct data centers, and when configured with replicas residing in different data centers, the SLA associated with an AG climbs to 99.99%. An AG configuration doesn’t rely on shared storage the same way as a SQL Server FCI.

Instead, it provides services that automatically synchronize SQL Server data amongst the replicas in the AG. If the active SQL Server instance fails, the designated replica server takes over and begins hosting databases that have been replicated to it. But this approach also has its shortcomings.

AGs replicate user-defined databases, but not key system databases — including the Master and MSDB which hold things like agent jobs, logins, and passwords. If there’s a failure that takes the primary instance of SQL Server offline, none of these databases are protected. It’s also worth noting that Microsoft has not tested AlwaysOn Availability Groups beyond 100 SQL Server databases or 10 AGs, which may pose other constraints if you want to protect a large number of databases.

Enter a New Generation of Shareable Cloud Storage

For all of the reasons cited above, the news of emerging high-performance, cloud-based file shares sounds intriguing. Will they enable organizations seeking cloud-based HA and DR solutions to move beyond the constraints that Windows Server has imposed?

In the long term, the answer may be “yes.” AWS indicates that organizations can use Amazon FSx to configure a WSFC today. All of the nodes in the WSFC would have access to the file share, so if the primary node goes offline and the cluster fails over to a secondary node — even in another data center — that secondary node could continue working with the SQL Server data stored in the Amazon FSx file share.

Similarly, Azure indicates that organizations can configure SQL Server in an FCI using an Azure premium file share. These descriptions suggest that the elegance of the failover cluster we have always known on-prem has finally made it into the cloud.

In the near term, the answer remains “no.” Today’s cloud-based shared file offerings have at least one major flaw: The underlying SLAs of the cloud file share services themselves guarantee only a 99.9% availability for read and write operations, which is far lower than the 99.99% SLA that constitutes the baseline for high availability.

And there are other issues, though these may simply reflect growing pains, as noted in the table below.

Table 2: Issues in Cloud-Based Shared File Offerings and Potential Outcomes

Potential Outcomes
Microsoft Using Azure premium files in “input- or output-intensive workloads” is discouraged. Their utility in your use case may be limited.
File share can reside in only one data center (in all but two regions in Azure). If the data center goes offline, your cluster goes offline with it.
AWS There is no mechanism in place to ensure the active SQL Server instance and the file share always reside in the same AZ. The absence of such mechanism could introduce latencies that would compromise the performance of your SQL Server system.
Failover of the file share from one AWS region to another is documented to take about 30 seconds. During this time, your SQL Server FCI will go offline and may require manual intervention to bring back online.

Advances in the delivery of cloud-based file shares are probable, and one day, all cloud service providers may be able to offer SLAs that guarantee the availability of the underlying SQL Server data at a 99.99% or higher. Then, cloud-based file shares will become a viable alternative to S2D, AGs, and other third-party options that we’ve relied on to date. When will that happen? We shall see.

This is an excerpt of DZone's 2020 Cloud Database Trend Report.
More on the state of cloud DBMSs, data security, and database migration:

Read the Report

cloud data stores ,cloud database ,data safety ,data security ,data security breach ,database security ,database trends ,disaster recovery ,high availability

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}