At RailsConf 2012, we sat down with five of our customers (including Jesse Proudman, panel moderator and CEO of the Blue Box Group,) and asked them about the ups and downs of scaling massive websites. In the session, they discussed how to manage millions of unique visitors, unexpected traffic bursts and more.
NR: What are the three most actionable items you should pay attention to as your application grows?
JW: As your application grows, you must keep your eye on (at least) the following:
- Slow web transactions: Focus first on those with the highest total time consumed as this will ease the load on your data layer. But be mindful of the overall slowest requests as these are terrible experiences for your users. When your early in your growth, those first users are critical ambassadors. Don’t make them wait!
- Set up a realistic Apdex on your Real User Monitoring (RUM): Users don’t care how long your app layer took if your total page render time sucks. Use an aggressive apex on RUM and work on improving that number every step of the way. Use a CDN and a professional DNS network (Dyn Enterprise, AWS Route53, etc.) immediately.
- Consolidate Monitoring: Put all the data you can on one table. Use APM’s Server Monitoring on all your hosts. Monitor and metric your background jobs along with your app requests. Use custom metrics where appropriate to give you a big view of everything in once place.
NR: What two items caught your team by surprise?
JW: There are many, but two that come to mind immediately are how easy it is to let queries get out of hand, and how non-DB requests can sneak up and slow you down.
ORMs such as ActiveRecord are blessings for productivity and used wisely have minimal performance impacts overall. But it’s fairly easy for even an experienced developer to let suboptimal usage slip in. Classic cases are N+1 queries that should use eager loading, or queries not hitting useful indexes. Proper use of APM will highlight these immediately when given a representative data set. Hopefully that’s in a staging environment, but even if it happens in production, it’s better to catch it quickly than to have no idea what’s being slow. With respect to non-DB requests, we’ve been bitten by libraries accessing Memcache thousands of times per request because of errors in our usage. This “mystery time” went unexplained until we added metric coverage for memcache calls. At the very least, EVERYTHING that talks to the network needs to be covered by its own metrics.
NR: Walk us through your capacity planning process?
JW: Our CP process is very basic and organic. We have deployed fully into the AWS cloud and use automation tools such as Chef to make adding servers (and configuring them for New Relic) as painless as possible. We look at our Server Monitoring to see if we’ve crossed a given threshold of activity, say 65% CPU, and then increase the size of our cluster by 25-50%. The numbers may be subject to whim, as we don’t have hard and fast policies here. We’ve been stable on the same size cluster for some time now, thanks to continual performance improvements.
NR: How many people are accessing your site on mobile devices and how do you optimize for that?
JW: Currently about 10% of our traffic to our primary web property (Coolspotters.com) is mobile, and the entire user base of our native iOS apps are mobile. I don’t have specific numbers handy for a quick reply.
NR: How do you solve the data challenge and what do you do with the data you collect?
JW: We use a variety of tools to collect and derive value from data. Our primary storage for core application data is Percona Server (a MySQL distribution). We use MongoDB for secondary services and for fire-and-forget analytics collection. For reporting we use unsophisticated SQL jobs run periodically to roll up data into aggregation tables for visualization on a dashboard or exporting. We have plans to add real-time aspect to this, and to use more of MongoDB’s map-reduce capabilities, but what we have today is sufficient.