This article is about how we built the new highly scalable cloud hosting solution using IPv6-only communication between commodity servers, what problems we faced with the IPv6 protocol, and how we tackled them for handling more than 10 million active users.
Why Did We Decide to Run an IPv6-Only Network?
At Hostinger, we care a lot about innovative technologies, so we decided to run a new project named Awex that is based on this protocol. Only frontend (user facing) services are running in a dual-stack environment — everything else is IPv6-only for west-east traffic.
I don't want to dive into details with this post, but I will describe crucial components needed for building this architecture.
We are using pods. A pod is a cluster that shares the same VIP (Virtual IP) addresses as anycast and can handle HTTP/HTTPS requests in parallel. Hunderds of nodes per pod can handle users' requests simultaneously without saturating a single one. Parallelization is done using BGP and ECMP with resilient hashing to avoid traffic scattering. Hence every edge node is running a BGP daemon for announcing VIPs to ToR switch. As the BGP daemon, we are running ExaBGP and using a single IPv6 session for announcing both protocols (IPv4/IPv6). The BGP session is configured automatically during server bootstrap step. Announcements are different depending on the server's role, including the /64 prefix for each node and many of the VIPs for north-south traffic. The /64 prefix is specially delegated for containers. Every edge node runs plenty of containers, and they communicate among each other between other nodes and internal services.
Every edge node uses Redis as a slave replica to get upstream for a particular application, hence every upstream has thousands of containers (IPv6) as a list spanning between nodes in a pod. These huge lists are generated in real-time using consul-template. The edge node has many public IPv4 (512) and global IPv6 (512) addresses. Wondering why? To handleDDoS attacks. We use DNS to randomize A/AAAA for a client's response. The client points his domain to our CNAME record, named
route, which, in turn, is randomized by our custom service named Razor. We will talk about Razor in the further posts.
At first, for the ToR switches, we decided to use OpenSwitch, which is quite young but an interesting and promising community project. We tested this OS in our lab for few months and even contributed some changes to OpenSwitch, like this patch. There were a number of bugs, and most of them were finally fixed, but not as fast as we needed. So, we postponed experimenting with OpenSwitch for a while and gave Cumulus a try. By the way, we are still testing OpenSwitch in our lab because we are planning to use it in the near future.
Cumulus allows us to have a fully automated network, where we reconfigure elements such as BGP neighbors, upstreams, firewall, bridges, etc. on changes. For instance, if we add a new node, Ansible will automatically see changes in the Chef inventory by looking at LLDP attributes and regenerate the network configuration for a particular switch. If we want to add a new BGP upstream or firewall rule, we just create a pull request to our GitHub repo, and everything is done automatically, including checking the syntax and deploying changes in production. Every node is connected with a single 10GE interface using Clos topology. Here are a few examples of pull requests:
Problems We Tackled During the Process
- Different format for defining IPv6 addresses: Some services use brackets to wrap an IPv6 address inside (
[2001:dead:beef::1]), others do not (
2001:dead:beef::1), or the best are (
- Libraries incompatible with the IPv6 protocol: For example, the Sensu monitoring framework doesn't support IPv6, so we moved to Prometheus.
- Cisco IOS bug: We were unable to use a single IPv6 iBGP session for handling both protocols because Cisco includes link-local address with global as the next-hop. There were two options to exclude link-local addresses: use private ASs or loopback interfaces as the update-source. We moved to a private AS numbers per rack.
- MTU issues: like receive queue drops. We run many internal services on VMWare ESXi nodes, so after launching the project in-lab, we saw many drops on the receive side. After deep investigation, we figured out that the drops were due to a bigger MTU size than expected (1518 + 22). By default, the NIC has an MTU size 1500 + extra underlying headers, including the ethernet header, checksum, and .1Q. First, I tried to change ring buffers for receive queues, but it was enough only for a short time — they filled up too quickly, and the vmxnet3 driver wasn't able to drain them fast enough. I logged into the ESXi host and checked for vmxnet3 stats for any guest machine:
# of pkts dropped due to large hdrs:126. These are large header drops, so I decided to hook on a
vmxnet3driver and check
vmxnet3_rx_error()to see what buffer length is hitting the queues. That was really disappointing because buffer size was 54 bytes, and it wasn't even an IPv4 or IPv6 packet. It was just some VMWare underlying headers. Finally, by adjusting MTU for nodes running on ESXi, we were able to handle all packets without dropping them.
- The IPv6 protocol is much more acceptable and more scalable for larger infrastructures.
- There are a lot of tools, services, and libraries that do not support IPv6 — partially or at all.
- IPv6 allows us to define and control address space more granularly than IPv4.
- IPv6 has better performance, even if its packet header is higher than IPv4. No fragmentation, no checksums, no NAT.
- Lack of IPv6 is a bug, not just a missing feature.
- We fell in love with IPv6.