When most companies have an SLA (service level agreement) in play, the business and legal folks deal with the contracts and examine the fine print, and the engineers don’t have to worry about the details.
In migrating our SaaS platform, however, we had to sweat the small stuff. Our technical design decisions directly impact the SLA, because:
- An IaaS VM does not receive the 99.95% Azure SLA if you use an unsupported OS 
- An IaaS VM does not receive the 99.95% SLA if it is in a single-server Availability Set 
The first point isn’t an issue for us because we keep our OSes up-to-date; however, some organizations may have “unpatchable legacy systems” they are migrating. They will have two choices: Have a 0% SLA on those VMs, or upgrade to a supported OS.
The second point is the one that bugs me. Azure, like AWS, has the concept of a Fault Domain. In addition, Azure also has Update Domains: Groups of hypervisors that are patched together and rebooted at the same time. Patch the hypervisors in update domain #1, reboot them at the same time, move onto the systems in update domain #2, and so-on. Faults happen, and patching the underlying infrastructure is necessary.
Common scenario: You have a VM that you only need one of—a monitoring server or a metrics server or a SMTP server or a reverse proxy server. Sure, you could spin up two or more of each and make them highly available, but most organizations spin up a single server and deal with the infrequent outages.
If you spin up a single monitoring server in Azure, the VM will not qualify for the 99.95% IaaS VM SLA. But if you spin up two monitoring servers in the same availability set they will qualify for the SLA because Microsoft considers them “highly available.” A zone-wide fault would bring down one but not the other. The rolling hypervisor updates would bring down one at a time, but not both.
At this point Microsoft considers them highly available and will give us a non-zero SLA, but from a practical perspective they are not. Our team will need to perform additional work to make our Azure monitoring servers highly available—work that does not need to happen in our other environments (AWS, bare metal). This may include:
- Setting up an HA solution to pass a VIP back and forth
- Configuring the systems so only one is active at a given time
- Modify our monitoring server monitoring strategy (which one is active?)
- What happens to in-transit alerts during a failover? (Dropped or re-delivered?)
- What happens to the state of the monitoring servers (eg: active alerts) during a failover?
- Was the monitoring software designed to failover or will hacking be involved?
As you can see, it isn’t as simple as spinning up a second server. There are questions to be answered and engineering effort to be expended in order to create a functional solution.
What’s your take on the SLA situation? Do you spin up one server and eat the outages, or did you expend the effort to create an Azure-specific, highly available solution?
Affinity groups are similar to AWS placement groups , however Azure’s are much more powerful because all resources in an affinity group are placed near each other. While AWS provides an intra-VM 10gb network when you use a placement group, Azure provides that and data locality, which is huge. Heck yeah, I want my data close to my compute: why not?
I’ll spoil the fun: Microsoft recommends against affinity groups . In the beginning Azure was only a north-south network but in 2012 they redesigned the fabric to add an east-west network for what is essentially internal Azure traffic . Due to this change, you no longer need to use affinity groups.
So far our performance has been adequate, but some type of compute and data placement option could be a great differentiator for Microsoft’s cloud.
The Azure CentOS image will configure sudo one of two ways: password required or passwordless. How sudo is configured depends on whether you set your login account to use password, SSH key, or both. 
- Password set = password required for sudo
- Password set and SSH key injected = password required for sudo
- SSH key only, no password set = no password required for sudo
This was a gotcha because during our early testing we used passwords for our crash-and-burn systems, and due to this early pattern we assumed passwordless sudo was not an option. As a result, we found we could not use Terraform’s Chef provider to provision VMs because of the sudo password prompt.
We decided to manually provision the nodes using knife bootstrap. Later, when we went keys-only we learned about the different sudo options. In a later iteration we’ll work on integrating Terraform’s Chef provider so we can skip the manual bootstrapping.
OS Disk Performance
Azure’s OS drives are specifically designed for fast OS boot times. They are not designed for fast data access and, we assume by extension, fast application startup (big JVMs, etc). This is an intentional design tradeoff by Microsoft. 
We are happily ignoring their advice.
We put everything on the OS disk, and so far things are fine. We are:
- Aware of the performance limitation
- Designing our OS image and file system layout with the limitation in mind
- Gathering metrics for performance analysis
- Watching trends to determine when we should split off non-OS data + apps
Easy as that: If performance starts to stink we can bust out the data disks—but for now we’ll keep it simple, save the money, and iterate in the future if necessary.
Each Azure resource (VM, public IP, load balancer, etc) must be a member of a resource group. You can set Azure-level access controls at the resource group level. A resource group can contain resources from any region.
That said, how the heck do you design your resource group layout? We’re integrating two regions, so we decided to have a resource group for each region and threw all of the region’s Azure objects into the appropriate group. We didn’t think about it too hard.
We didn’t get fancy for two reasons: first, we don’t need the access controls. We ported over our access controls from lower layers (e.g.: the DBAs can’t login to the app servers, network ACLs, etc.) and we saw little benefit to adding another layer. Second, our design is more cloud-agnostic if we don’t make heavy use of the feature. When Azure goes away in 70 years, the guy who took over my job should have an easy time porting it to the next cloud, right?
There are two great things about the logical groupings that resource groups provide. First, it’s easy to keep test/non-prod/sandbox environments separate from real production. I have a simple “joshtest” resource group and I do all my crash-and-burn work in there; my co-workers can easily determine the purpose of the resource group if I’m not available.
Another area where resource groups shine is compliance. It’s easy to create reports of which Azure accounts have access to which resource groups, and so long as you limit access properly at the Azure level (eg: DBAs only have access to the DB resource groups) it will be easy to audit and prove your access controls.
We only use resource groups because we have to, so we kept it simple and did one per region.
Email and SMTP
Azure doesn’t have a built-in service like Amazon’s SES and we don’t want to fiddle with setting up another SMTP server, so we’re using the SendGrid Azure Marketplace app.
So far things have been smooth, but there’s one big gotcha: If you hit your email limit you cannot extend it.  For example, if you buy a package of 100,000 emails on the first of the month and hit your limit on the fifteenth, you’re hosed: you cannot “extend” the account. Instead, you must delete the Marketplace app and “re-subscribe,” or wait until next month. If you re-subscribe to a new account you lose all existing analytics and have to reconfigure Postfix. Oof!
We could have tried a non-Marketplace provider, but instead we mitigated this in three ways:
- We learned our lesson early. Our failed cron jobs were spitting out errors and used up our small allowance quickly.
- We created monitors that alert us if only X% of our monthly quota remains.
- We keep our SMTP configuration in Chef so it’s easy to change if changing providers is necessary.
One of our test accounts hit the limit and I had to purchase another one. However, this time, the account didn’t change. When I deleted and re-purchased the Azure Marketplace app my SendGrid login stayed the same, unlike in the past. Perhaps SendGrid and/or Microsoft updated stuff in the background?
The failed cron jobs quickly ate up our monthly allowance, but we’ve fixed them and our real customers won’t generate nearly as many emails; so, we feel we made the right choice.
Hey, why did you send my data to Ireland?
Azure storage accounts have several replication and redundancy options to choose from. I won’t dig into them all, but it’s important to understand the implications of Global Redundant Storage (GRS) being the default option:
- It’s the highest level of durability that Azure offers (3 copies of the data in the local region and 3 copies in a different region). More copies = more $.
- The inter-region replication is async in the background (good!)
- Out of all the storage types, it costs the most (bad!)
- The documentation states “… and is also replicated three times in a secondary region hundreds of miles away from the primary region.”
Whoa, dude. Will my data be leaving the country?
According to the docs it looks like they designed it so replication stays within the same country, but there’s a gotcha for Europe: Northern Europe (Ireland) replicates to Western Europe (the Netherlands.)  Be aware of this, and work with legal and other areas of the business to determine if storage replication will create compliance issues.
We went with a non-GRS option because we have existing workflows to backup our data to another site: thus Azure-level inter-region data replication was not a compelling feature. Understand where your data is automagically being replicated and speak with your legal department for guidance.
Use the portal to create a VM, then use the portal to delete it. Next:
- In the portal, go into the storage account you specified when you created the VM
- Click on Blobs
- Click on vhds
- What the heck?
Yep: Azure doesn’t delete the VHDs when you delete a VM. If you don’t know about this behavior, you could have terabytes of old VHDs costing you money.
We believe this is an intentional design decision by Microsoft. The action of deleting the VM is just that: delete the VM, leave the other stuff. One analogy is VMware’s “remove from inventory” vs. “remove from disk.” The first deletes the VM container but leaves the virtual disks, and the second does both.
Similar to resource groups, Azure has the concept of storage containers. Each Azure account has one or more storage accounts, and each storage account must have one or more storage containers. An example use case for storage containers is to have one for “private” data (VHDs, data disks, etc) and one for “public” data (anonymous access enabled, world-readable.)
Originally we created one storage container per storage account. Much like resource groups we didn’t have a good reason to use them, but they are required, so we used them.
Getting back to my original point: Our work-around is to create an Azure storage container for each VM in Terraform. Terraform seems to see the container as the parent object, the VM as the child and the VHDs as children of the VM. Terraform will delete the container and by extension delete its children (the VM and disk.)
Create a storage container for each VM, then delete the container in order to delete the VM and VHDs.
We evaluated the Azure VPN offering and found three things that dissuaded us from using it:
- No active directory integration
- Requires a certificate for each individual client
- Does not support password-based authentication
- Only allows alphanumeric (no symbols) in the PSK
We decided the management overhead and lack of AD integration was just too much and instead opted for running commercial VPN appliances on Azure’s IaaS VMs; then we hooked that into our active directory.
No AD integration is a showstopper, from my point of view.
VIPs and Network Interfaces
When you cluster systems for high availability they have a VIP (shared IP) that floats between them. The VIP is assigned to a Linux virtual interface (such as eth0:1) on the active system. We wanted to reuse our existing HAproxy + keepalived configuration in Azure but we couldn’t find a way to make it work (and neither could this guy .)
Like AWS, Azure has a separate cloud construct for the NIC. The Azure NIC is separate from the VM and is managed separately. IPs are assigned to the NIC and the NIC is associated with a VM. The NIC and its associated IPs can then be moved to a different VM. The key concept is that all operations on IP addresses are done at the Azure NIC level, not at the OS NIC level.
Unfortunately, keepalived does not know how to interact with the Azure NICs so it cannot set the IP on the Azure NIC, re-assign it or monitor it. Until keepalived is updated so it can talk to the Azure networking constructs, we cannot use it.
Currently, we are running hot / cold with a manual, but scripted, failover process. Automated failover would be ideal, but we are not willing to trade it because:
- Moving completely to Azure load balancers and removing HAproxy.
- The time we put into customizing and tuning HAproxy will be lost.
- We don’t know the bugs/warts/gotchas about Azure’s load balancers.
- We have strong HAproxy operational knowledge vs. very little Azure load balancer knowledge.
We’re aware of the limitations of the current solution and are discussing our options. It isn’t ideal but it’s functional, we are meeting our (CA Agile Central) SLAs and we continue to re-evaluate and iterate.
When a Shutdown Isn’t a Shutdown
Shutting down the OS from within the VM does not stop the Azure billing meter.  Commands such as “shutdown”, “init 6”, and the Windows shutdown menu option will not stop the billing meter.
When you shutdown a VM, be sure to use the portal or API. If OS-level shutdowns are a requirement for your organization I assume an OS-level shutdown and then a second “Azure level” shutdown (portal or API) will work.
In a Nutshell
My typewriter is almost out of ink, so I’ll be brief: Our team and platform are both very Linux-centric so we were apprehensive when we found out we were using Azure, but overall the migration was a very positive experience. On a scale of 1 to 10 I give it a B+. Thanks for reading!
The team that made it happen, listed in an order determined by a D20:
Craig J, Dave S, Cassie K, Ali Y, Kate T, Cameron M, Matt S, Bala N, Roland W, Mark T, Jeff S, Chris B, Justin D, Cameron C, Ken G, and Adam Z