Popular clouds like Amazon AWS and Microsoft Azure are console-based orchestrators, enabling people to spin and administer the needed infrastructure themselves. They also come with a variety of features and add-ons, making the end solution very attractive.
Quite often, due to their scalability, these clouds appear like a massive computing resource, where performance limits are hard to reach.
One common performance issue people face relates to disk or storage performance.
During various tests, AWS or Azure have been seen, for example, doing thousands of IOPS and hundreds MBps of disk throughput at low latency. So one should expect that these environments may be the best place to run high performance virtual servers like SQL servers, which generally demand high IOPS and throughput at low latency.
The Storage Story: Capacity and IO, These Two Are “Not Very Good Friends”
When it comes to storage, there will always be a concern that IO will be exhausted before space. From a business perspective, this will result in a waste of resources. Due to the automation introduced by the cloud consoles and the power the customers have to provision at will, if left without rules and caps, the entire environment will become oversubscribed with degraded performance as a result, leaving cloud providers struggling to deliver the IO without adding more unused capacity.
It is either the IOPS count or throughput that kills the storage. The throughput, if the storage is network based and not local, will also affect the network switches requiring fast switches with big buffers when throughput increases and bursts.
How Far Can You Actually Go Into Popular Clouds Like Amazon and Azure?
In the case of Azure, this document states the following:
with Premium Storage, your applications can have up to 64 TB of storage per VM and achieve 80,000 IOPS (input/output operations per second) per VM and 2000 MB per second disk throughput per VM with extremely low latencies for read operations
That means, a P10 Premium Storage disk attached to this VM can only go up to 32 MB per second but not up to 100 MB per second that the P10 disk can provide. Similarly, a STANDARD_DS13 VM can go up to 256 MB per second across all disks. Currently, the largest VM on DS-series is STANDARD_DS14 and it can provide up to 512 MB per second across all disks. The largest VM on GS-series is STANDARD_GS5 and it can give up to 2000 MB per second across all disks.
Cache-hits are not limited by the allocated IOPS/Throughput of the disk. That is, when you use a data disk with a Read-Only cache setting on a DS-series VM or GS-series VM, Reads that are served from the cache are not subject to Premium Storage disk limits. Hence you could get very high throughput from a disk if the workload is predominantly Reads. Note, that cache is subject to separate IOPS/Throughput limits at VM level based on the VM size. DS-series VMs have roughly 4000 IOPS and 33 MB/sec per core for cache and local SSD IOs.
So Azure actually place IO limits based on disk size and also conditions them based on VM size (like cores per cache hits).
If you continue down the page, you will see that in actual fact, a single disk performance is much lower, especially when the disk size is low, because they even limit throughput on performance SSDs. And things become somewhat complicated, especially if your priority is just having your application up and running in the cloud.
For example, a disk of size 100 GiB is classified as a P10 option and can perform up to 500 IO units per second, and with up to 100 MB per second throughput. Similarly, a disk of size 400 GiB is classified as a P20 option, and can perform up to 2300 IO units per second and up to 150 MB per second throughput.
The input/output (I/O) unit size is 256 KB. If the data being transferred is less than 256 KB, it is considered a single I/O unit. The larger I/O sizes are counted as multiple I/Os of size 256 KB. For example, 1100 KB I/O is counted as five I/O units.
Azure associates an IOPS with a 256 block size, which is good as this is a rather large block IO, so if your SQL does 64KB block IO, it will not actually divide your IOPS limit.
What About AWS?
The situation is somewhat similar with Amazon, if we look at their stats. Amazon appears to be better equipped to combine performance and space, compared with Azure.
General Purpose (SSD) volumes have a throughput limit range of 128 MiB/s for volumes less than or equal to 170 GiB; for volumes over 170 GiB, this limit increases at the rate of 768 KiB/s per GiB to a maximum of 160 MiB/s (at 214 GiB and larger).
A Provisioned IOPS (SSD) volume can range in size from 4 GiB to 16 TiB and you can provision up to 20,000 IOPS per volume. The ratio of IOPS provisioned to the volume size requested can be a maximum of 30; for example, a volume with 3,000 IOPS must be at least 100 GiB.
Magnetic volumes provide the lowest cost per gigabyte of all EBS volume types and these volumes deliver approximately 100 IOPS on average, with burst capability of up to hundreds of IOPS, and they can range in size from 1 GiB to 1 TiB.
You can stripe multiple volumes together in a RAID configuration for larger size and greater performance.
Provisioned IOPS (SSD) volumes have a throughput limit range of 256 KiB for each IOPS provisioned, up to a maximum of 320 MiB/s (at 1,280 IOPS).
So with Amazon you get a somewhat better performance per disk size.
What Happens When Customers Want More IO From Less Provisioned Storage Capacity and Certain VM Specs?
Our cloud has standard caps as well. For example we soft-cap at 1000 or 2000 IOPS per 100GB disk for regular VPS’s. IO limits are also dependent on the number of cores and vRAM subscriptions.
We work with our customers to help them configure and achieve the needed performance levels for their virtual servers and application(s). Our goal is to provide the customer the best possible experience and service, so that we retain the customer for the longest period and help them grow.
Let’s take a look.
Below we have two different, what we call, medium performance 8 cores/12GB RAM SQL virtual servers, each has a rather small data disk, approx 300GB. These SQL servers are hard on reads, and are IO throughput hungry with 64k blocks IO. They look somewhat similar, with a small difference – one is harder on throughput and the other somewhat harder on IOPS.
Both VMs are capped.
But in this case, capping was done by working with the customer ensuring that his VMs do not bottleneck and they do not exceed the budget by oversizing them. We do not throttle the burst by time and the VMs are able to max out as long as needed to their configured limit.
Also the cap is global, so it’s predictable; when the customer has random IO, the same high cap will apply, without needing the application. A similar situation will occur on Azure where cache hits have higher caps, compared to IO going to disk.
We also analyzed the IO block size pattern of the customer and set the IOPS cap based on this, rather than on a generic IO block size applying to all VPS.
What’s interesting about the two VMs is that the IOPS are not very high, it’s the throughput that’s high and they both burst aggressively. Such VMs are usually trouble makers for a service provider as they put strain on both the SAN and the network, especially with the random and high throughput IO bursts.
This translates into:
- The network being fast enough to manage the bursts. Webhosting.net uses Arista deep buffer switches and the technical advantages of Arista’s EOS platform are:
- #1 in 40G and 100G core networking
- Captured 12.2% market share in core data center switching & growing
- Most stable and resilient network operating system refined over 15-years
- Single binary image runs across entire switch portfolio
- On-ramp to SDN at the client’s own pace
- Leader in automation for lowering OPEX & CAPEX in modern cloud networks
- Open standards based with nothing proprietary in Arista’s EOS
- Redefining network architectures by enabling extensibility to data center networking
- Dramatically changes price for performance
- #1 partner to VMware w/ Arista as the underlay for VMware’s vAirCloud platform
- We protect both the SAN and the network by introducing local storage acceleration with PernixData FVP software. FVP handles storage reads and writes locally in SSDs. This enables us to easily scale-out performance by simply adding SSDs to an ESXi host. It also offloads I/Os from the network and SAN (see figure below)
- Latency is also minimized with FVP. By handling storage reads and writes locally, we can minimize storage response times, which maximizes VM performance
For example on one cluster of ours, Pernix saved quite a few resources from the SAN and network.
So essentially what we did, was to provide the customer with performance at a desired space usage, without forcing them to use X space for Y IO and high end expensive SSDs or a certain amount of cores or RAM. The customer did not have to figure out their IO patterns and do the calculations themselves, we did all of this for them by looking at the application performance and latency on the backend).
If we look at Amazon and Azure:
- Using Amazon, the customer could have run the VM, but only on Provisioned IOPS (SSD) Volumes. What we have not illustrate above, is that the VMs without caps can go upwards of 450MBps per 300GB disk. Amazon per disk limit is 320MBps
- Using Azure, the VMs would not even fit when most IOs are served from cache, although with random IO this will not be the case due to disk size IO limits (remember that with Azure the reads that are served from the cache are not subject to the Premium Storage disk limits, higher limits apply, but eventually when the customer will do random IO, it will hit the lower limits of the disk)
- Customers may not always have the skills, expertise, motivation or time to do the math and/or the research on the different cloud providers and individual nuances, rather the customer’s objective was to simply have their applications perform in the most cost effective environment
Live Increasing of Resources
One key thing within a cloud compared to a dedicated physical deployment, is the ability to hot-add extra resources, for example increasing the number of CPU cores or amount of RAM or by increasing disk size.
For example, the application starts to put load on the CPU and extra cores are needed immediately. If the OS supports this, and many OS currently do, webhosting.net can add more cores, memory or increase the disk size on the fly without requiring a shutdown or reboot, which would typically bring down applications which are required to be always up, and prevent any downtime.
For example Azure currently does not have or plans to have the CPU or memory hot add capability capability according to their site.
We can also live migrate virtual machines to faster resources.
A Little Extra
On a regular day for me, a customer calls saying that for various reasons he lost important files or his VPS has issues resulting in a request for a restoration of the machine.
On various clouds, customers are responsible for their own backups. But sometimes, they may not have a backup available or in fact their backups have been compromised.
Apart from actually helping the customer to backup their applications and virtual servers, there is a little extra service we offer. We can help restore files from our own snapshot backups, provided this is in our backup restore points, even if the customer has not subscribed to a backup service. Many times we have restored accidentally deleted files or compromised virtual servers for customers who saw no value in having a backup plan to start with.
Remember the Cloud Space misfortune where an attacker took control of their Amazon account and deleted all their stuff including their backups, even when they were relatively fast to react? Well, we could have restored pretty much all of their data in a matter of minutes, from our own backups.
Summing It Up, Your Tailored Cloud Solution
By combining VMware, high performance storage and low latency networking along with Pernix acceleration, Webhosting.net’s VMware cloud offering can take massive loads of IO, serving those directly from local host SSDs and increase the limits of the IO per capacity.
The result is that if needed, we offer more IO and throughput from a smaller sized VM without forcing the customer to overpay for an oversized virtual server.
Bottom line. Every cloud solution has its limits.