In the aftermath of Super Storm Sandy, this panel of CTOs from AppNexus, adMarketplace, Tapad, x+1 and Aerospike discussed issues and best practices in architecting and operating Real-time Big Data systems at ad:tech New York, 2012. The transcript of this event is broken out into four topics outlined below. This is the third of the series on secrets to scaling systems and thoughts on Hadoop:
- Impact of Real-time Big Data on the Business
- Super Storm Sandy and 100% Uptime
- CTO Secrets to Scaling Systems & the Scoop on Hadoop
- CTOs tips – Developing a Scalable system, Problem Solving at Scale
Audience member: Can some of you comment on what lessons have been learned from scaling systems, something that’s not obvious?
[21:51] Mike N: That’s a broad question. I’ll give you two answers to that. Specifically, two lessons learned. I think for us, we found anything that’s not simple will fail. And now we’re rare in terms of throughput and volume, because we’re just running on so many servers, running so many ads every single second. The simpler the architecture, the fewer points of failure, hands down is the best.Which is actually why we threw out all our load balancers — because at some point we found that load balancers start dying at 600,000 qps. We found that we had more outages due to load balancer issues than we had due to applications or software issues.
We ended up effectively embedding load balancing into our direct applications, and magically things got a lot better. I think simplicity is just, hands down, hands down, where you actually want to be.
And then the second one is full, end-to-end automation of everything you do, like your example about manual error, right? People fail. People naturally fail. They’ll fail 80% of the time on the average day. This is not a problem; this is not a bad person. Your good engineer will make mistakes all the time. Now, at two in the morning, you make mistakes most of the time.
Anything that’s not automated will have production issues. It’s really that simple. If you can’t point-and-click deploy what you have, if you can’t point-and-click roll back, you can’t point-and-click debug, you’ll simply have production issues because at two in the morning someone’s fat fingers and types the wrong thing, and “Oops. Shoot. I did something wrong.”
And so, you want to keep it incredibly simple, and then automate the snot out of it, is what our head of tech ops actually says. You look at where we have production issues, we’re not perfect, and the only place where we have serious issues is where we do not have full automation in place. Where we can’t just pull up a new server, or fail over a data center, or anything like this.
And our stack, it’s almost all home grown. We use Aerospike for key value stores, we use Vertica for calling our databases for reporting, and then the rest of this — I don’t know Hadoop if you call that a license, because you have to do so much engineering around Hadoop to make it work. And then everything else is home grown software for us.
[27:35] Mike Yudin: So I could probably add to this. Mike is saying, keep your systems simple, and that’s key to this. The way you keep it simple is you have to be very smart in devising your intelligence into online and offline. You do all the heavy lifting over predictive modeling, all the crazy algorithms offline. What you program into your real system is just really quick lookups.
And another advice is to keep your system asynchronous. Because as soon as you have components depending on other components, depending on other components, and everything waits for the other thing to respond, and then everything is fine, everything works just fine, but then one little thing is going to fail and there’s going to be a cascading effect and everything is going to come to a crawl, and you just have an avalanche of degradation to the system.
So you have to have a graceful degradation policy, and you have to have asynchronicity as much as possible. We just had a discussion right before we started, about blocking threads, and these kinds of things. That’s the key to this. As far as technology stack, and we all were just discussing this, pretty much anything works. These days, hardware is powerful. Use a proven platform, whatever it is, C, Java, .Net, they all work. Probably not a good idea to program your real-time ad server on Ruby on Rails, but other than that it hasn’t been a problem.
Audience member: What’s beyond Hadoop?
[29:20] Mike Yudin: What’s beyond Hadoop? I’m not going to tell you, because we don’t use Hadoop. [laughter] We don’t use Hadoop that much. We found it’s too slow for us. The processing cycle for data in Hadoop is just…
Mike Yudin: Well, the main principle of Hadoop is a distributed, kind of geek computing system. That’s not going to go anywhere. You have to do that. People are trying all kind of things. We ended up writing our own proprietary system. Whether that’s going to become the Internet standard or not, I don’t know. But, probably like some of my colleagues, we found that very, very few commercial or even open source solutions support this key element, so we ended up programming a lot of this ourselves.
It’s kind of tragic, but that’s what it is.
[30:10] Dag: I would say, in terms of lessons learned, metrics. Keep metrics of everything, because scale kind of creeps up on you. You start seeing your latencies jitter, and you want to correlate it. You should try to figure out what’s going on. If you’re building an adtech system, you have a lot of moving parts. You have a lot of endpoints that get hit. These have different impacts on your system, and if you don’t have metrics, you’re pretty much blind.
Scale, web traffic shifts, grows from month to month. We’re hooked into AppNexus and they get more traffic all the time and then one day all of a sudden we realized, “Oh, crap, we’re over.” Things would slide just a little and then we’re over 100K qps, for instance, and we see that it’s this kind of traffic that causes this kind of ripple.
Any sort of debugging, if you have any sort of performance regressions, always looking at the metrics as the number one thing we use for debugging. You can’t live debug and step through coding production, and sometimes the only way you can debug some things are actually hitting them with real traffic, unless you have unlimited resources. But metrics is what I would say.
And beyond Hadoop, [laughs] a more efficient file storage on the [??-31:35], I think. That’ll do a lot. Having flat files for everything is…
I hope that Brian and Srini don’t try to turn Aerospike into multi-tool, and that they understand what they’re really, really good at, which is having key value, NoSQL based data, an incredibly low latency and incredibly fast. So I think that’s the best answer to your question, that Hadoop is a tool that’s really good at some things. Aerospike — there are new tools coming out that are really, really, really exciting. And Aerospike is one of them. I’m also very excited about stream-based processing, which I think we’re going to start seeing more of. Which could be — I don’t know if you guys are talking about some of this — new products or things like that. That’s what I think is going to get really, really exciting.
Because then people’s heads explode. “Where’s my budget? I didn’t ask for enough budget. I have to justify more.” Stuff like that. So think about the impact of your business model with your scalability.
Audience member: From the standpoint that Hadoop is not the answer, I already know that, what’s the future?
Dag: I don’t think — Hadoop is a lot of things…
Srini/Moderator: I think we’re getting a little bit off the topic. This is about real-time big data. Hadoop is not real-time in any way, shape or form, as far as I understand.
Audience member: I understand, a lot of people try to write plug-ins, and…
Srini/Moderator: Yeah, it is a way to — for example, how Memcache speeds up MySQL is the same philosophy there. But then things like Aerospike, and a whole bunch of other NoSQL databases, they try to actually do a database at the speed of a cache. So beyond Hadoop, you know? If you ask me, I’m biased. I’d say Aerospike. [laughter]
[32:30] Mike N: I’d love to make two points. One, you only have a certain amount of tools in your toolbox, right? Hadoop is one of those tools. And if you try to screw a screw with a hammer, it doesn’t work very well. Just like if you try to use a screwdriver to hammer in a nail, it doesn’t work very well. Hadoop is the right tool for some things, Aerospike is a fantastic tool for some things, and Vertica’s a fantastic tool for other things. The problem people make with all of these, including key value stores, is that they try to smoosh too much functionality and too general purpose, and try to make this super-fancy, multi-tool that does everything.
And it turns out being pretty mediocre at everything. When I get pitches from vendors, a lot of times if they sell me too much — there are a lot of these people working on distributed real-time MySQL systems. One of these guys came in, and pitched to us that “We can be your key value store.” I actually rejected it outright, simply because I don’t want a key value store that does SQL and joins. It means it’s a multi-tool, which you just know is going to have some kind of complicated performance issues. It’s just too complex for what I want.
[34:30] Mike Y: Well, I would like to add to this a little bit. There are more and more companies you see at tradeshows like this. That are always on the cutting edge, and they are the most sophisticated algorithms, the most amazing model ever. Hadoop is just not good enough for them. Anything is not good enough for them, because they process so much data, they have to do the next coolest thing. The truth of the matter is that there are very little actual working models and intelligence in this advertising world. A few things really work.
If you’re trying to solve a really complex problem, beyond capabilities of the standard stacks and proven technologies, I’m going to bet you a hundred bucks you’re probably not going to solve the right problem.
[35:17] Mike N: Can I add, and then I’m going to totally let you take over. I think what you just said is exactly true. What happens is, doing something at low scale — like what you hear from all the CTOs, we’re saying, scale, scale, scale, scale, scale kills — because it’s very easy to build an online advertising product at low scale. It’s very easy to build a super snazzy, dynamic, creative, it’s interactive, it talks to you, it uses your web cam, you can get real-time reporting on the back end — if you’re only serving a couple thousand ads a day, no problem. I’ve got an engineer who could turn that around for you in a month. The problem is when you start doing it a million times a day, and a billion times a day, and ten billion times a day, and 40 billion times a day. That’s when all those features and functionalities that you have in that really cool product break.
And one of the problems with innovation in all advertising is that people don’t think about scale. People raise VC money, build a prototype that does a lot of really cool stuff, they say they do it at scale but it really doesn’t do it at scale, then they hit scale, and then the shit breaks, and then you have to rebuild everything. And there just aren’t enough commercial tools out there to make these problems go away. So you suddenly end up needing 40 engineers to make all this actually work.
[36:26] Pat: Use the right tool for the right job, as Mike got to, and be willing to iterate and explore before you get into production. We have a lot of complexity, I would say, in our system, but we’ve approached it as splitting that problem up into as many discrete parts as possible.
Again, echoing the sentiment, it’s got to be testable. It’s got to be modular. If you want to stream it, you want to communicate, use something zero MQ. There’s a lot of queuing out there. There are ways to communicate among these different components, and that way you can test them. We are very, very diligent about metrics, as Dag mentioned. If you write code in our system or for our system, that thing better log and let the world know what the hell is going on with that component, pretty much at every step of the way, if it’s interrogated.
Because if we can’t — again, echoing the same sentiment — if we can’t look and see exactly what’s happening inside any of these components, we’re screwed. And you’re flying blind. And then we can start testing them, we can run through any regressions we need to, and we can see where the bottleneck is going to be. You’ll never catch all of them; sometimes you don’t catch any of them. Hey, you know what? We never thought someone was going to do twenty-five placements on one page with 50 models. Sorry.
Okay, we didn’t. Bad on us. But you can catch a lot of that stuff, or at least you can pinpoint it. Otherwise, again, I don’t see how anyone would even want to go to work in the morning without knowing how everything’s operating.
[38:09] Brian: So I’m going to add to your question, you said something that’s non-standard about scaling. One thing I’m happy we did at Aerospike early on, is that most cluster databases — for example, Oracle RAC and many other of the support-based systems — charge you per node. And what happens then is every single time then, the ops guys, who have a feeling for the number of nodes they want in order to have the resiliency they want, they get crowded out by business guys and it ends up being some long, involved conversation about, “Well, do I really need a four-node license,” “Why do I have to buy a six-node license,” all that stuff.
So one thing we did to make all these guys more successful, and the whole product more robust, was our business model. Plenty of technology, and all of the logging and stuff like that, but I wanted to say, “Hey look, let’s have your ops guys really figure out what they want in terms of reliability, the number of copies of the daily load, we’ll decouple that from the license terms, and as you start iterating, seeing your load go up, you need to add more servers, great. You’re not calling us.” We want you to do that, and feel comfortable with the amount of hardware you have without having to start thinking about license terms.