The Impact of Real-Time Big Data on Business
The Impact of Real-Time Big Data on Business
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
In the aftermath of Super Storm Sandy, this panel of CTOs from AppNexus, adMarketplace, Tapad, x+1 and Aerospike discussed issues and best practices in architecting and operating Real-time Big Data systems at ad:tech New York, 2012. The transcript of this event is broken out into four topics outlined below, followed by the introductions of each speaker and how real-time big data had made an impact on their business:
- Impact of Real-time Big Data on the Business
- Super Storm Sandy and 100% Uptime
- CTO Secrets to Scaling Systems & the Scoop on Hadoop
- CTOs tips – Developing a Scalable system, Problem Solving at Scale
Srini Srinivasan/Moderator: We have here a distinguished panel, from several real-time advertising companies who are key technical leaders in this area. The panel members really need no introduction, but I’m going to request them to give a brief introduction for themselves, starting with Mike from AppNexus.
Mike Nolet: I’m Mike Nolet, CTO/co-founder at AppNexus. What we do is we sell technology to companies, which helps them run real-time businesses. We work with real-time sellers and real-time buyers to Ad Exchanges, SSPs, DSPs, ad networks, all the various different kinds of entities in the space. Our major customers are Microsoft Ad Exchange, for example on the sell side, Orange, now Interactive Media, which is Orange Telecom in Germany, and then on the buy side we have major companies like Collective, and also eBay, that use us to buy on a real-time basis.
Mike Yudin: Hello. My name is Mike Yudin, I’m the CTO at AdMarketplace. We’re an advertising technology company based right here in New York, and we operate the largest search network outside Google and Yahoo Games. We work with major, best and largest internet brands delivering performance pay-per-click traffic to these advertisers world wide, and remain the eighth fastest growing private company in New York. We have a lot of traffic just like everyone here on this panel, and we solve complex advertising problems in real time, result with data that comes our way.
Dag Liodden: I’m Dag Liodden, I’m the CTO of Tapad. Tapad is a fairly young technology company, we help advertisers reach audiences across their multiple screens. If you’re a user in this day and age, you probably have tablets, you have iPhones, you have laptops, and what we try to do is we help advertisers target, and measure performance across multiple devices.
Pat DeAngelis: I’m Pat DeAngelis, I’m the Chief Technology Officer for [x+1] Solutions. We’re a digital marketing hub, much more on the advertiser’s side. We enable cross-channel analytics and optimization across multiple touch points, typically for enterprise clients. Typically me would do site optimization, we have a real-time bidding DSP if you will, using AppNexus’ sell site. We are actually a partner of AppNexus. Our clients include J.P. Morgan Chase, Capital One, Fingerhut, FedEx, Delta, and some of the largest brands on the Internet.
Brian Bulkowski: I’m Brian Bulkowski, from Aerospike. And I’m filling in for the BlueKai gentleman. And as co-founder with Srini Srinivasan, and one of the inventors of our technology and database.
Srini Srinivasan/Moderator: Thank you. So let’s get started. Real-time big data has been used in several critical aspects of the advertising business. This panel is an attempt by us to bring together some of the foremost experts in the area so that other people can learn about this evolution and participate in this session.These companies in real-time advertising use big data in very interesting ways. For example, they use it at the edge server level, where they deal with millisecond SLAs day in and day out. They also deal with analyzing the data and then feeding a new model in on a periodic basis, maybe every hour, every day, and so on.
What I will start with, the initial question to the panelists, the question essentially is: What are one or two instances where real-time data processing has had a tremendous impact in a positive way on your business over the last year or two?
I’ll start it off addressing this to Mike from AppNexus.
Mike Nolet: On our side, I don’t know if it’s a positive impact so much asit’s something without which we can’t operate our business. When we do real-time buying, which is obviously something we do, we listen to every single real-time ad that’s available. We power a fair number of them, and that’s a whole bunch. We see in our peak day I think 39.5 billion ads in one individual day. It’s about 600,000-odd requests per second. Every second — I think right now is about peak time — every second we are bidding on 600,000 ads.
And the reality of that is, a lot of that buying is being driven by behavior, and so we must have cookie data server-side. For us it’s not a choice. We have to have a server-side data storage. We have to be able to do 600,000 reader requests a second. Now we deliver about 150,000 – 170,0000 ads a second, so every time we win an auction we do write updates to our cookie store. And so for us, we don’t have a choice, right?
For us, we actually work with Aerospike, and we were – were we the first or second customer? First. So back when it was Brian and Srini in a coffee shop in San Francisco, and they were two guys and they said, “Yeah, we can do this for you.” We didn’t really have a choice, because we were working with another vendor that was just truly terrible, who will remain unnamed. And so we actually trusted them with this, and we’ve had a fantastic ride over the last three and a half years as we started at, I think, 10,000 qps and climbing to 600,000 qps.
Probably hadn’t found every bug the first time, so all good for all of you if you want to work with it. For us, basically, real-time key value store has enabled all of our real-time buying business. Also it’s a platform for the ecosystem, so what’s really exciting is — I don’t know if you guys know this on AppNexus you can effectively use our key value store, you can use our infrastructure and our data centers and actually put your own data in there and use that. It’s really enabled us to provide fantastic technologies and offerings to our customers.
Mike Yudin: AdMarketplace is a search syndication network. What that means is we look at each request for ads, of which we get about a billion a day, or 50,000 per second, in three dimensions.We’ll look at the traffic source where an ad is going to be displayed, we’ll look at the user who’s going to see the ad, and we’ll look at targeting information such as the keyword that the user typed in the search box.
When a request gets into our system, we look at these three dimensions. We have to pull data in one or two milliseconds, on all of them, we have to know as much as possible of the user, as much as possible about the traffic source, and match this based on the keyword with all the ads that our system bid on. And then make a prediction, essentially, as to what would be a fair price per click for each ad that matched this request.
And we turn this to advertising, all 50,000 requests per second. Just like Mike said, there’s no question, there’s no sort of supplementary value of the real-time data store, it’s a necessity. Without that, we wouldn’t be where we are. All the competitive intelligence for a net company like ours, everyone on this panel is in edata and how you use that data. The more data you have and the better access to this data you have in real time, then you can make very intelligent decisions and you can have sophisticated offline processing.
But what is modeled all happens after the fact. The core of this is a very simple business. You get in a request, look at data, in ten milliseconds they’ll ask you to turn ads, and then you see what happens. That’s a constant, never-ending cycle.
[10:06] Well, a success story. Sure, very simple. One of our advertisers, Volvo cars, started an ad campaign with us. Their goal was to actually get people into a Volvo dealership to drive Volvos. So we started a very broad ad campaign. We’d never had Volvos advertised in our system before. So how do you know if a person who is about to see this ad will have any interest in driving a Volvo?
So what you do is, you look at past history of these users. You see, has this person searched for things like new car prices,” or “”test drive.” If you see the user with that type of search, in the history in the last week or so, you can probably know they’re in the market for a car, especially if the request comes from a perfect source, like a car blog or something like that. It doubles your chances.
So we use all this data. We successfully execute it on the campaign, and I think Volvo reported that the ROI was actually higher than their search engine buy on Google. You can find this on our website. So there’s many stories like this. And this would not be possible without data. Otherwise, you’d just be spraying and praying as they say. So we don’t do that.
[11:40] Dag: So, our business is similar to what Mike said. We also do real-time buying. What makes our setup a little bit different from all of the other places is that we’re not just looking at individual devices, we’re also looking at how they’re connected to other devices. And we want to try to use that for targeting, but also for attribution reporting, so after something happens, so if someone actually goes and buys something, we want to see which of the devices were involved with this chain that led up to this purchase.
What Aerospike has enabled us to do is not do all these things after the fact, so traditional in this space, you often do log shipping, and then you go through these logs afterwards. You sift through, and try to find patterns and you kind of do an offline batch processing of these things. What Aerospike has enabled us to do is that we can keep this entire data set, which we call the device graph, which has data about all the devices we see and also the connections between them. We can actually query that data in real time.
If someone goes and buys or signs up for Netflix – (they’re not a customer of ours by the way) if someone buys a service, we can go into our graph immediately, start with that device and see which other devices are related to this. And we can pull that in real time, instead of having to shift these logs through some big distributed file system like Hadoop, and then have to run a heavy job that maybe come up with a result 24 hours later.
We can do all these things in real time. We can call up the partners, say, a second ago someone signed up and the cross device impression history of this subset of the graph looks like this. So basically we have access to our entire data set and subset, in response times at all times.
Pat: We also at [x+1] have a demand-side platform, which does real-time bidding. A lot of the stories I’m hearing from my colleagues here, I can echo the same sentiment. If you can’t make a decision based on as much data as possible, in a few milliseconds, you’re pretty much toast in that business. Where [x+1] is a little bit different, we also do a lot of onsite optimization. What that is, basically as an example, you go to the FedEx home page, you’re going to get a bunch of offers, so we’re pretty down in the marketing funnel; you go to FedEx’s website, home page, their offers that they’re going to provide are powered by [x+1], 100%.
So these are pretty high value transactions. The way we do that is typically through a predictive model.We build predictive models continuously. We have all our processes that run and tweak these models, and basically these models execute in real time.
So one side of the equation for us is to make sure that we can execute those models, as quickly as possible. And the other side is to make sure we have that vector of data on that user so we can provide the best offer. Otherwise, we can’t optimize their experience, and we don’t get paid. So I would say what’s really helped, and the success story for Aerospike is, we can now onboard anywhere from 5 to 10,000 attributes for a user, put that up in our data store, and slide through those vectors with our model all day long.
We can chain models together, meaning we can execute a model. If that results in fetching some more data and executing another model, we can certainly do that now well. Data can come from offline. We can get a file from somewhere like an Axiom, with 500 attributes, we load it up and in the next few seconds, that person goes to anywhere on, let’s say, Chase’s home page, their mobile app, what have you, we have that data. We can execute that model, we can optimize. That just wasn’t possible to the same scale before we went to Aerospike.
Check back in the Big Data Zone tomorrow for "Super Storm Sandy and 100% Uptime."
Published at DZone with permission of Claire Umeda . See the original article here.
Opinions expressed by DZone contributors are their own.