In previous article I mostly wrote about how VMware Virtual SAN (VSAN) actually works and have considered virtualization workloads up to 100 VMs which is relatively small needs which was required 4 nodes with 10 magnetic drives in each. But what if you building VSAN infrastructure for thousands VMs? In this article I will give example of how to design storage complex like that.
Because VMware VSAN, like the any other Software Defined Storage (SDS), does not have any central elements, we will not have central bottlenecks or single point of failures. And since we can scale (or modernize) our storage complex on per drive basis, we can achieve cool approach in very evenly investments during building our storage platform. This is something like as a Pay As You Grow model.
In VSAN the write operation from each of hosted VM actually turns into as much writes operations as many we need retain data copies. These copies are always placed on different nodes, racks or rooms, depending on what we have declared as a failure domain. Rather than increase availability of every individual disk, adapter, port or whole node with hypervisor, SDS aims to tolerate their failures. And almost all SDS realizations to comply with this copy replication policy use low latency ethernet. This replication entity, which usually called Storage Data Plane, in VMware VSAN 6.0 leverages 10Gbps multicast ethernet. Generally speaking, for all replications, rebalancing and recoveries (reprotections) between nodes is used ultra fast network. Thus, properly sizing large VSAN implementation requires eliminating network bottlenecks in any possible planned and unplanned situations.
Since in large SDS implementations we will have far more nodes than number of retained data copies, our failure domain can be whole rack, room and so on. Declaring this allows us have no redundant network to individual nodes. For example, if VSAN cluster will contain one Leaf Top of Rack (TOR) switch per rack, failure of a NIC or a switch will lead to node or rack unavailability, but not for stored data. Data availability will be ensured as long as other copies retained on other racks.
Thus, in this simple example, taking as a condition that single magnetic drive can do 100MBps, than to saturate one 10Gbps interface we would need 10 drives per node. And taking in to account that on our nodes will run VMs, we will have VMs client, live migrations and management traffic. This requires considering additional network capacity per node. For simplification, I will consider that this traffic can take up to 10Gbps too and for this example eventually each node will contain two 10Gbps ports.
Declaring as a failure domain a whole rack to tolerate the loss of one copy of stored data we will need to design no less than 3 rack per VSAN cluster. And in case of sustaining endurance during rack maintenance at least one additional failure domain is considered, so in total I design four racks per one VSAN cluster. Next, according to the current vSphere 6 configuration maximums, one VMware VSAN 6.0 cluster can contain up to 64 nodes and hence each rack can contain sixteen 2U nodes with in total 320Gbps network capacity per rack. So, without making any network oversubscribing each Leaf ToR switch must contain no less than:
- 8 x 40Gbps ports towards to Spine switches;
- 32 x 10Gbps ports towards to nodes of that rack, two per each.
Then, since using Leaf-Spine topology and having for example 5 VSAN 6.0 clusters across 20 racks in 4 rooms with 320 nodes in total this design will require 6Tbps of bandwidth capacity and might contain 8 Spine switches with with no less than each:
- 16 x 40Gbps ports towards to Leaf ToR switches;
- additional remaining 40Gbps ports must be designed to connect to client networks (border leaf) and for future platforms growth.
Implementation of the Leaf-Spine topology allow us to consider the following planned and unplanned situations:
- during the Spine switch failure we will lose some part of bandwidth capacity, but not the connectivity between racks;
- during the Leaf switch failure we will lose the connectivity to the rack and lose one copy of stored data, but other copies will be retained on other racks.