Onebip recently moved most of its server base from the Milan data centers to multiple availability zones on Amazon Web Services. While it's easy to build a new service on the cloud, Onebip had been up and running for years before this migration and our system administrators had to work within the usual constraints of not losing data and do not interrupt the service. Being a payment system, every minute of downtime directly translates to money lost.
Here's a recount of the experience of changing physical hardware to virtual one placed in another country while it's running every day. I observed this from a developer's point of view, so please pester the Twitter accounts linked above if you want to know more. :)
The architecture of Onebip consists of (almost) shared-nothing servers running PHP applications. The applications work on several databases (MySQL and MongoDB) which instead contain the state of purchases, subscriptions and of customers balances.
The first step of the move was thus to spin up multiple web servers in the new Virtual Private Cloud, and make it work on the same databases. The connection between data centers was guaranteed by a couple of VPN channels.
This task alone occupied system administrators for weeks as they had not only to guarantee a safe channel between the data centers, but also to make sure all traffic exited from the same IP addresses as before. We work with several hundred mobile phone operators, which have whitelisted our data center addresses; communicating with them directly from the new VPC was not going to work, and update to their whitelists are still in process.
For example, one carrier inserts encrypted headers in HTTP (not HTTPS clearly) requests for identifying the phonenumber. However, it only does so on several domains such as onebip.com which have a commercial agreement with it. Furthermore, it has a lists of IP addresses that takes months to update: if the servers behind that domain change you won't get that header anymore.
Moreover, unfortunately there was no constraint in talking to databases in the design of the application when it first ran in a single data center (and no shardability). The chatty web servers ran a lot of SELECT and INSERT queries, establishing back and forth traffic between the two data centers even for serving single requests (our payment page). When you put web servers and databases in separate countries, you can expect that.
So the flow for loading a page from AWS went like this:
Client -> Milan load balancers --VPN-> AWS in Ireland -*> Milan databases
The * denotes the high number of database requests with respect to the other channels where each request passed only once. Basically, for a plain PHP page which was not connecting to the databases hosted by AWS we were talking about 200ms after the optimizations described below; for one just connecting to them, we were talking about from 800 to 1200ms loading time.
2 weeks of optimization follow
Basically, our payment page when accessed through AWS web servers was unusable due to the high load time, which passed from a less than 5 seconds in the worst case to more than 10 seconds. Not only this latency issue rendered the page unusable, but also worried us about how much bandwidth was to be occupied on the VPN connections.
So we set out to optimize the chat between web servers and the database, with several strategies in mind:
- caching: installing Memcache on each local webserver, and storing cacheable data on that. There are many instances of data with a TTL of hours or days, such as merchant secret keys, subscription service configuration parameters, or lookup APIs such as the TeraWurfl responses.
- use of secondary servers: you can have more than one secondary both on MySQL and MongoDB, as asynchronous replicas. Data with a TTL of at least minutes such as reports can easily target a replica which is kept in the same data center as the web servers. While with MySQL you can explicitly query a secondary server at the application level, in MongoDB
- Upgrade to PHP 5.4 (incidentally) to speed up the computation, even if that was not the bottleneck.
- Stateless APIs: for some of the URLs hit the hardest by clients, such as our phone number detection service, we ported them from storing intermediate information in a MongoDB collection to transmit an encrypted query string to the next redirect.
- Explicit redirects to the Milan data center: we setup a italy.onebip.com virtual host to direct clients whose ip addresses were belonging to the problematic operators. The client stayed there just for the time of reading the encrypted carrier headers and then returned to onebip.com which could be served by any web server and any ip address.
After these rounds of optimization, we were running on a almost acceptable time. But high loading times directly affect conversion rates, so it was still unacceptable to use web servers on Amazon...
To be continued
We were left with web servers chatting with databases in another country of Europe, highly optimized but still much slower than the original data center; outgoing traffic still passing through Milan while incoming traffic could go in the majority of cases (fortunately) directly to AWS load balancers. In part 2, we will see how the big move was prepared by our sysadmins to took primary databases and web servers at the same time.