The danger of large releases: Trenord case study
In these days (Italian link) the majority of trains in the Milan area in Italy have run late from 15 to 60 minutes, while a significant number of them have been cancelled, lowering the frequency of the service from half-hours to the hour or more.
The causes of the disaster are multiple: mainly the introduction of a new timetable and a new work shifts calculation system (which were previously handmade), and in the North-West zone the opening of a new line, the S9.
The transition to the new system impacted 700,000 commuters, calling themselves Trenord victims (Trenord is the region- and state-controlled firm than runs trains); people have been stranded in stations because of the lack of capacity in the remaining trains. I myself am a victim, and after three days of troubles I think we can learn something about migrations and upgrades.
The software itself
Trenord bought the GoalRail software through its system integrator company, NordCom, 9 months ago. GoalRail itself is a time-tested software: it runs schedules for the French high-speed TGVs, and the whole public transport system of Bogotá, Colombia's capital city.
The software package is diffused throughout the world, and may be unit and automatically tested and torched by a QA department. But a large system like Trenord is composed not only of software, but also by hardware (in this case not much relevant, as it seems to have worked), by configuration and data, and ultimately by its users. This combination is called an Information System.
How not to perform a migration
Here's how to highly increase the risk of catastrophic failures in your information system while migrating or upgrading it:
- select a date for the NewBigPlatform™ to be released.
- Configure your new system, entering tons of rows into tables and .ini files.
- Toggle the switch on the selected date, and hope for the best.
Reducing the risk of migrations means performing smaller migrations and not bigger ones; this is not new concept, and goes under the name of Continuous Integration and Continuous Delivery/Deployment.
All the unit tests in the world won't save you
The GOOS book proposes the concepts of internal quality and external quality: the larger the scope of the tests you run on a software system, the more oriented to the customer they are. For example, end-to-end tests that drive a browser ensure that your web application is providing its intended services; unit tests are very fast, and necessary from an engineering point of view, but don't matter to the customer in the short term (rightly).
So the more your move towards the end of your delivery pipeline, and the more you should run larger and all-encompassing tests. The first stage in Continuous Integration is a unit test suite, while the last one is monitoring how many green widgets per hour are sold throughout your website, and whether it's the case to rollback the last release.
Returning to the train disaster, even if the software is sold as a standard package, this doesn't mean that its configuration and database do not need tests. Wherever there's risk, there's the place for investing money and times in tests to avoid a larger loss (like thousands of people stranded in stations and experiencing hours of delays to get home after work).
For example, the M5 underground line in Milan is currently undergoing pre-exercise, a trial period of 45 days; during pre-exercise, the underground trains run on a segment of 7 stops, back and forth all day, without a single real user being allowed on board.
What happened with Trenord instead is that a new segment of the S9 line was opened without pre-exercise, and the new planning system went online in the same day as timetables changed. The system integrators had 9 months to setup the software, but decided not to invest in monitoring and in breaking down the migration into safer, smaller steps.
Mitigating the risk with smaller releases
Here are some strategies to mitigate the risk of new releases in your information systems.
Canary releasing, a concept introduced by the book Continuous Delivery, consists in releasing software on just one or a few servers instead of on the whole farm. When the load is so easily parallelizable, canary releasing may result in:
- a good case: you monitor production servers holding real traffic, and seeing no Null Pointer exceptions, Fatal Errors, and garbage data you gain the confidence to push the release to all the users.
- A bad case: only a few percent of your servers have a new broken release, and you suffer only 10% (or whatever percentage) of the damage that you would have caused with a large new release, celebrated with inauguration scissors.
Feature flags make a feature available and integrated on production servers, but not accessible to the end users, or at least to the general public. Code and database schema changes can be integrated frequently, but thanks to the flags they do not influence the user experience. For instance, instead of going live on a single day, you can preinsert columns and rows, and monitor the new schema before hand and ensure it doesn't break other indexes, views, or queries. If you really desire to make a schema change or to introduce indexes the same day of a release, raise your hand!
Back to train examples, here's some strategies that Trenord could have adopted to break down the BigRelease™.
About launching the extension of the S9 line:
- make a single train run empty on the line; collect feedback on the state of the tracks and electrical wires; fix the problems that will inevitably come up. I guess at least this has been done on the S9 line?
- Make a pair of of trains run on the line, in the opposite directions, to test their intersections; feedback and corrections follow.
- Experiment a whole day of testing with all the trains running.
- Start normal exercise, but without opening the trains to the public.
What happened instead? The first trips on the new part of the line broke the electrical wiring, and the segment was closed half a day (it's now closed for good).
About transitioning to the new shifts:
- start calculating shifts by hand, make a single line run with the new shifts.
- When the turns can guarantee a complete crew for the test trains, switch to the system for that line.
- Restart with another line. It is usually possible to isolate lines because the same trains run all the day on them, going back and forth; the trains that need to interact with other lines start to make an impact on them, an impact that will only be enlarged with time. I'm no domain expert, but it would be his job to search for the lines along which to cut the migration.
By no means change the shift calculation on the same day when timetables change too. Someone must have see the occasion as a way to save money and perform a single transition!
But there's actually much more money lost in what has happened, due to recalling workers out of their shifts, reimbursements, fines, overtime to fix this mess.
Speaking of other domains, at Onebip we try to break migrations for our payment solution into little pieces, while introducing new components in place of legacy ones. Some examples:
- generating diffs between the output of the two systems that ensure they are equal, before switching.
- Migrating the front end one country at the time, not all the traffic in a single shot. That gives the business colleagues the time to check the payment page and process are compliant with the laws and norms of each country, to avoid fines and disgruntled users and merchants.
- Inserting new users in the new system, while keeping the old ones in the legacy. This temporary component duplication allows to migrate the bulk of the data only after the new system is proven to work with real transactions. And since its database is initially empty, you have the option to test and monitor your new system while the traffic is light and legible.
There's no mean to predict every interaction of modern software components with people, servers, real data, and with each other. The only way to fully validate a line of code is to put it online and observe its beneficial and adverse effects; for a user story to be done, it must be deployed and hit by real users.
Testing on production doesn't mean testing on real users, at least not all of them; the concept of small migrations spans from the Continuous Integration of new commits to the frequent deployment of the system on production servers, and their monitoring in the search of regression.
There are costs involved in routinely testing and monitoring (hopefully automating a big part of these activities) but it's part of being professional: a large system can provoke large losses. Your job and your reputation are on the line.