We were ready to switch web traffic from the servers in Milan to multiple Amazon Web Service availability zones; and to switch the master-master configuration of MySQL servers from writing to the Milan server to writing to the latter.
However, reharsals of the migration have pointed out that you cannot switch atomically all running processes on multiple machines from one to another, nor it's easy to make sure all connections are terminated. The best solution in cost terms was to switch off everything, reconfigure and switching on again.
The planned downtime
Our system administrators prepared maintenance pages and warned merchants of a planned maintenance for the 2AM of that Wednesday night. For the first time in the last few years, Onebip was going to be powered off. This timeline was chosen because of nighttime in Europe, our biggest market.
This was the moment when every member of the team realized he couldn't miss the big migration and wanted to be present. I personally went to bed very early, setting my alarm clock for 2AM; some of us started working on terminating jobs in the hour before so that we could be ready to switch off during the maintenance window. We were 8 in total to be up at night, while other colleagues were ready to take over in the morning when we will be exhausted.
Then, at 2 AM, cron jobs were being terminated and both DNS and Apache virtual hosts were reconfigured to show the maintenance page. The CPU utilization graphs went down.
Terminating a MySQL master
By looking at the process list, we were seeing some aggregation jobs still running even after the removal of cron configurations; they were simply started before the window. Since these jobs were idempotent and could rerun at a later time, they were simply to be killed.
However, we wanted to take no chance of some misconfiguration or DNS cache that could let code connect still to the old master when we switched mysqlprimary.internal.onebip.com (fictitious host name) to the new one. So our administrators went nuclear and remove the MySQL user the applications were logging in with, and terminated all connections. Only our consoles were able to reconnect with the root credentials.
Give me a go
After infrastructure was taking care of, we started testing in parallel the different functions of Onebip that could potentially break - mobile billing via PIN flows or by sending SMS to the application, one-click flows where the user is detected through the 3G network, and the fragile ISP billing where the source of funds is a DSL line invoice.
After several manual tests and the activation of jobs, everyone gave a go the migration; we were ready to switch on traffic before 3 AM. The best testers are always the end users. We switched our attention to application and Apache logs to see any error while the system administrators monitor the load on the new machines.
One of this machines was the MongoDB primary server, and it had been potentially suffering after the traffic surge following reconfiguration of the DNS.
A new MongoDB
If a server has the potential to suffer during the night, imagine what could happen during the evening traffic peak or when we start a large queue of subscription renewals during the day. To take no chances, our administrators spin up in 10 minutes a larger instance, doubling the processing power (and its cost).
While MySQL was a pain to switch, MongoDB probably is better suited to these migrations. It has a centralized configuration where we could attach the new server as a secondary, wait for the replica to be in sync, and failover to the new chosen primary without touching the clients. In fact, even when you connect via the MongoDB console from the command line, every server knows if it's currently the primary and can tell you in the prompt along from stopping your writes if not.
At 4 AM half of the crew went to bed, to avoid a difficult next day. A team of 4 went on by changing project configurations in the source code repositories as they were on the production environment right now. This let us try and succeed with a deployment of every Onebip project to make sure the pipelines were green and our colleagues could work normally on the next day.
At 6:20 AM, after deploying and monitoring, the last group went to bed. The next stop was the Volemose Bene restaurant for lunch (seen as breakfast by some of us). Not heavier than a long night in a disco; we were ready to make a toast the new Onebip on AWS!