Windows Clusters for SAP on AWS

This analysis takes a look at the design aspects of the Windows Server Failover Cluster as well as key aspects to keep in mind during deployments.

Sourabh Chordiya

Jul. 18, 23 · Analysis

Likes (1)

Comment

Save

4.7K Views

The SAP Enterprise customers have strict availability requirements and to meet these KPIs, architects look up to clustering solutions to meet the requirements. The Windows Server Failover Cluster (WSFC) is one such cluster solution. It's available as part of Windows Server Datacenter edition deployments. This makes it a default choice for setups that are using Windows Servers.

Note that the SAP HANA database is only currently released for Linux (SUSE and RHEL). Hence in such setups, the WSFC can be used for SAP Central Services (ASCS/SCS). For MS-SQL deployments, the WSFC is also used to provide high availability with SQL Always On.

In this article, we will go through the design aspects for WSFC and understand key aspects that shall be kept in mind during deployments.

WSFC Setup

WSFC setup has the following minimum components that shall first be accessible and corresponding administrators for each of the components shall be available.

A shared filesystem (FSx in AWS)
Windows Domain Controller
DNS
Windows FileShare Witness
Domain administrator permissions for SAP installation

Let us look at each one of these in detail.

Shared Filesystem

The "sapmnt" filesystem acts as one of the core components in SAP installations as it contains profiles, hosts SAP kernel binaries, and works in conjunction with cluster software to provide seamless failover capabilities. This is achieved using various implementations. Out of these preferred native mechanisms in AWS is to use FSx. It is basically a Windows Server-based scalable file storage that is provided by AWS in a SaaS model under the name of FSx. They are accessible using SMB (Service Message Block) protocol.

Remember that SAP only supports DNS names of up to 13 characters; hence, an additional DNS name (C-record) shall be created so that during SAP installation it can be provided which follows SAP naming convention and character limitations.

Windows Domain Controller

Windows Cluster is designed to work in conjunction with the domain controller. The Cluster Name Object (CNO) gets created in Active Directory itself using these domain controllers. It is usually an existing domain controller and an existing Active Directory that customers utilize. In the case of SAP on AWS, the administrator should clearly define these components and ensure that they are preferably located in AWS with a local copy to avoid latency problems. Typically there are multiple domain controllers to provide high availability and it should be understood clearly how this setup works as it may happen that cluster state is incorrectly reflected between multiple domain controllers which leads to inconsistencies. In case customers have both on-premise and AWS domain controllers, it is preferred that all configurations in WSFC are mapped to the AWS domain controller itself which in turn will get reflected in all domain controllers and also ensure that minimal latency exists between SAP servers and domain controllers.

DNS

The IP addresses are mapped to user-friendly names in a DNS, and it works with the domain controller and Active Directory Domain Services (AD DS) to provide the lookups. There is one important difference between Linux and Windows clusters here as to how the virtual IP address for a cluster is managed. This virtual IP is sometimes referred to as "Floating IP." In case Linux Clusters is implemented using Overlay IP, while in WSFC, the IP addresses from each cluster node are updated in DNS against the virtual cluster name, and the lookups for the virtual name are adjusted during a cluster failover. The update of DNS hence becomes a critical factor for end-users' access to the cluster, and in case this update is not reflected immediately for all end-users' machines, it can lead end users to see errors even though the cluster itself has failed over.

Windows FileShare Witness

Each cluster requires a quorum to determine and resolve a split-brain situation where both cluster nodes can end up in a racing condition to determine which one is the active node. An external device (in this case, FSx), resolves this situation by providing an external device that is part of the cluster and acts as a third node. Both the cluster nodes perform an IO operation on the witness device, and the 2 cluster nodes along with the witness device, making a total of 3 nodes, perform voting to decide an active node at any point in time. While the amount of IO is small, it is important that the latency and throughput are kept minimal for this IO operation, as it will happen every few seconds and any minor delays to this operation can lead the cluster to an unhealthy state.

Domain Administrator Permissions

In the case of Windows installations, domain administrator permissions can provide the necessary permissions to install SAP on Windows. In many cases, this permission is not allowed by customer security; hence, the SAP installation guide lists specific permissions that need to be provided in such cases. This includes Create/Delete/Modify users, groups, computer objects, DNS Server records, and other OU objects in specific OU (Organization Unit) where SAP systems are to be created.

When setting up clusters, there are a few parameters that have an impact that need to be evaluated:

RegisterAllProvidersIP

This parameter determines if all the IP addresses of the cluster nodes that potentially span across other subnets are registered in the DNS by cluster or if only active node IP is registered. A value of 1 (that is now the default) registers all IP addresses in DNS. The problem in AWS with this configuration, specifically for ASCS/ERS clusters, is that it leads to multiple IP addresses being mapped to a single hostname and the resolution is not consistent.

This configuration works well for the MS-SQL Always On cluster as the database cluster design is implemented to attempt multiple connections to the available IP addresses by setting MultiSubnetFailover=True and utilize the connection that succeeds.

At this stage, the ASCS/ERS cluster design does not support a similar connection process wherein multiple IP addresses for the same ASCS virtual name can be attempted for parallel connections. Only succeeding ones are sent through for further processing. In the case of ASCS/ERS, this implementation is not in place, which means that there is no consistent and predictable behavior available. Hence, it's required to set this value to 0 and let the cluster update the IP address during the failover process into DNS.

Note that this can typically lead to problems in the case of multiple DNS servers that need some time to sync. See below for additional parameters for this problem handling.

MultiSubnetFailover

This is a client-side parameter that allows client applications to attempt multiple connections for different IPs defined in DNS at the same time from the client to the MS-SQL database server. It avoids timeout scenarios. However, the SAP application layer ASCS/ERS cluster is not currently designed to understand this parameter. Note that this is different from SAP application server connectivity to the MS-SQL database. SAP application servers can understand the MultiSubnetFailver parameter correctly, and hence, are able to handle connectivity to the database during failover. However, the SAP ASCS/ERS clusters do not have a similar parameter to handle seamless connectivity during a failover.

HostRecordTTL

This is a cluster parameter that defines how long on a client OS the cached entry is kept before it's queried to DNS again. A value of 1200 seconds is the default. However, it is recommended to lower this to 15 seconds by AWS and even to 1 in cases where downtime has to be almost zero.

This introduces another problem: the DNS is now queried very frequently by all client OS, which can be in the range of thousands for a large enterprise customer SAP application. There is no general value that can be determined for this parameter. Proper cluster testing should reveal an optimal value.

DNS Notify Mechanism

While the records update happens via TTL, it is common to have multiple DNS servers. Usually, the updates happen in the Primary Zone (or in simple words, a main DNS server) which is asynchronously updated to the Secondary Zone (or other DNS servers that keep a read-only copy of the DNS records). The frequency of this Notify mechanism can play an important role for the client machines to receive record updates in case of changes.

Since the cluster will modify A-record dynamically during failover in the Primary Zone, the client machines that are using Secondary Zones will see some delay and receive application error or timeout messages. To avoid this, one option is to use AWS Route 53, which is an AWS-managed DNS service. Keep a note that this SaaS-type DNS service is billed by the number of DNS lookups. Although the cost is not significant, depending on the number of lookups across the organization, this can become a critical cost component in case of a low TTL and heavy usage.

Another mechanism that can be used to handle DNS update delays is to utilize SAP Web Dispatcher and AWS Load Balancer as the client entry point instead of the SCS/ASCS which undergoes a change of IP address change during failover. There are 2 design options possible:

Utilize AWS Application Load Balancer (ALB) to directly route traffic to SAP Web Application Servers. In this case, it's important to ensure that the SAP application servers can see the DNS change immediately.
Utilize SAP Web Dispatcher (single or multiple in active-active setup). This can be used in combination with AWS ALB. The SAP Web Dispatcher provides additional SAP-specific session handling and routing mechanisms that are not available in AWS ALB. However, this requires additional servers to be set up.

Basically, either of the mechanisms above keeps a constant IP address that is mapped to the application end user URL at DNS. Hence, the WSFC failover event-related IP change in DNS will no longer directly impact all end-user client machines. Instead, the IP address change is only required to be visible to SAP Web Dispatcher and SAP Application Server instances.

AWS Application server Domain Name System Domain controller clusters Microsoft SQL Server

Opinions expressed by DZone contributors are their own.

Related

Trending