Let’s get this part out of the way. I am a big SimCity fan! SimCity 1 was the first boxed game I ever bought for a computer. Prior to this I had purchased some games at the local Radio Shack that were packaged in freezer bags. My friends and I would huddle around the computer in the school’s math classroom, at the end of a school day, and play as a group. It was intense. Ok, not just at the end of they day. I would ride to school with father on his way to work in the morning, so I could get to school an hour earlier than the school bus to play.
I was very happy to see SimCity 5 come out this week. I anxiously awaited it, and I pre-ordered it, and had it digitally download to my computer (no boxed products for me!). Of course I only had 30 minutes before I had to leave on a trip, so I didn’t get to play it until I got home later that night.
For those that aren’t in the know, SimCity has you play a mayor in your own city, determining how the city grows. You lay down roads, highways, and trains. You define zoning, taxes, and policies. It is an amazing simulation. They added, in this version, the ability to have regions where the cities buy/sell services and good from another. So one city might be on the coast and focus on tourism, while another might focus on power generation, selling power to the first. This is great. Even better is the ability to store your cities ‘in the cloud’ so you can play them from any computer. You can also have friends (or complete strangers) run other cities in the region. This creates a true global market for your cities to live in. I think they have taken SC to the next level of the simulation space, connecting different cities together.
BUT, like any big release, SC had some growing pains this week. I wasn’t affected too much because I was actually working when I wanted to be playing, but others couldn’t login, couldn’t play, or the system would be too slow. An effectively run backend is critical to games that rely completely on hosted services. Just to be clear, if the SC servers went away tomorrow, you would not be able to play at ALL! There is no concept of single player or offline mode with SC. This bothers me a little, since I don’t like humans, and so I don’t like to play games with them. I also tend to play a game all the way through, and then move on. I don’t play a game over and over for years and years. It’s because there is always a new game to play, something new to explore.
Back to the launch hiccups. EA hasn’t come out and described the issues beyond the standard “We’re sorry, and we’re working on it.” type announcements, so everything after this point is just conjecture. And I am really writing more about how you launch YOUR app than SC specifically. I just think in this case they make a great example to study.
Here are some ways thoughtful use of the real cloud could have helped. What do I mean by real? I mean a real cloud, not a data center you built yourself and dubbed a cloud. The Internet is not the cloud. A cloud has to have the core characteristics of a cloud, or it’s just a bunch of well placed sand.
1. A cloud lets you focus on the business- EA and Maxis are not data center, IT, or cloud companies. They are in the business of building awesome games. I have gladly given them plenty of my money over the past 30 years to prove this out. To that end, they should not be spending a lot of time and money building an IT muscle that isn’t core to their passion. Is the data center important to the game? YES! Important, but it is not a strategic differentiator. Their game IP and mechanics are their differentiator. They should leverage someone’s else cloud, architect it properly, and manage it effectively, and save tons of energy and money what is more important to them. Building your own cloud, especially in this scenario, is a waste of capital. Clearly in this case, they didn’t have enough hardware in the proper global locations ( I play on an Eastern European server because the North American servers are more than ‘busy’), and they don’t have the flexibility to easily adjust that investment IN HOURS!
2. All things decay- Let’s face it, SC is the hotness right now, but in a month they will see a decline in player hours from the peak. This will level off to probably 60% of launch day peak as a daily average. As they launch the game in new markets (it wasn’t a global launch) they will see new peaks, but these will also slowly erode. A year from now, the daily average will be less and less, until years from now they will have so few players they will take the servers offline.
That’s OK, all things decay. But as their peak server load decays, they will be left with hardware that is aging, and that they have paid for. In a cloud environment, they have elasticity. They can scale up and down as their gamer’s demand, and only pay for the servers they are using. You should leverage this strength in the cloud. We can finally move away from these sorts of resource constraints, and say ‘we will have the right amount of hardware when we need it, and no longer.’
And if you think this makes this super easy, it doesn’t. You still have to know how your app lives and breathes, how it scales, and when to scale. You have to understand the concept of scale units, automatic scale rules, and distribution.
The cloud don’t fix stupid.
The cloud would also give them the ability to scale by the hour. So when their peak play times come online, which likely ripple across the globe with the setting sun, they can spin up or down servers in each geo. Further saving money during low periods, and always having the right amount of hardware at the high times.
3. You didn’t test this?- I know that they had been beta testing the game with external players for quite some time. Not only is this a common testing/game balancing strategy to make the game better, but it is becoming a great support and marketing strategy as well.
On day one, during these issues, they had a group of experienced game players in the forums, on their own time, contributing to the tech support issues. Things like ‘will this run on my machine?’, ‘how do I draw more tourists?’, and ‘why are people getting sick when I put the water tower next to a chemical spewing factory?’
These beta testers also generated buzz and content about the game before it came out.
Their failure here was that they only tested the local game engine. Answered questions on hardware compatibility, network latency performance, and game balance. I can guarantee they did not have enough real players in their beta group to put a real world load on the servers. How do I know? Their servers were down most of the time on launch day (or at least that is what gamers perceive, and perception is reality.)
Another issue with testing in a traditional environment (and this probably applies more to you than someone like EA) is that the hardware available in the QA server farm is last year’s production hardware (at best). It isn’t identical, it doesn’t meet the latest requirements, and it probably hasn’t been configured to the production spec. Sometimes this can be called server drift, where the configuration of servers start to evolve away from data center’s intentions through inattention.
In the cloud, all of your servers are the same. So it is easy to spin up a test group that you KNOW will be identical to the production group. Also, they could have used the cloud to simulate players, perhaps even replaying recorded human commands. Multiply this by 100,000 by scaling your test bed in the cloud, and point that at your servers.
Now, you don’t start with 100,000 simulated players. You start with some reasonable unit, against one server. Perhaps 100 players against a server. You dial up the player count until that one single server dies. You mark that as your basic unit of scale (not just for compute, but also for storage, connections, etc.) And slowly crank it up from there by scaling out horizontally, perhaps to four or five servers. This will help you test how the system behaves under scale. The physics involved with an app on one server are very different from an app running across ten servers. You need to test this. And you need to learn where the bottlenecks are.
This pressure will also teach you what the vital signs of the system are, and what their break points are. Were they monitoring simple things like CPU and memory? What about ‘request queue depth’, response latency, and # of simultaneous users. You don’t really KNOW what to care about until you have done these tests. And these tests are so far past what people think of testing, perhaps they should be called load simulations.
I could go on and on about how the cloud could have saved SimCity, but it’s too late for them, and I have a coastline to develop with high density residences. Even if the system you are working on isn’t going to be a globally played simulator, and just a simple little app, please make sure you are testing it under load. Don’t just do happy path testing, test beyond what you expect. There is some serious engineering that goes on with high load systems, but you won’t KNOW what you need to do until it is too late.
With some back of the napkin calculations, they could have performed some serious scale simulations for around $10,000. That might seem like a lot, but compared to their marketing budget, their production budget, and the damage to their brand, it isn’t much at all.