Flickr's decisions about the underlying architecture centered around four considerations; their push notifications had to have:
- Minimal impact on normal page serving times.
- Near-real-time delivery.
- Deliver thousands of notifications per second.
- Highly available underlying services.
Event Generation and Targeting
They found that their existing push infrastructure was insufficient. The devs found, though, that they could use Redis for event generation and targeting:
The event generation phase happens while processing the response to a user request. As such, we wanted to ensure that there was little to no impact on the response times as a result. To ensure this was the case, all we do here is a lightweight write into a global Redis queue... Everything after this initial Redis action is processed out of band by our deferred task system and has no impact on site performance.
They built their message delivery using Node.js:
Flickr’s web-serving stack is PHP, and, up until now, everything described has been processed by PHP. Unfortunately, one area where PHP does not excel is long-lived processes or network connections, both of which make delivering push notifications in real time much easier. Because of this we decided to build the final phase, message delivery, as a separate endpoint in Node.js.So, the question arose: how do we get messages pending delivery from these PHP workers over to the Node.js endpoints that will actually deliver them? For this, we again turned to Redis, this time using its built in pub/sub functionality.
Dividing the workload across individual Node.js processes, they published. each message to the Redis pub/sub channel as a sharded channel.
Whenever we need to add more processing power to the cluster, we can just add more servers and more shards. This also makes it easy to pull hosts out of rotation for maintenance without impacting message delivery: we simply reconfigure the remaining processes to subscribe to additional channels to pick up the slack.
The team wanted to design their push notification system for high availability, which was a particular challenge because they wanted to ensure that they could lose processes, servers, or even data centers without the need for a human operator.
We liked the idea of having one host serve as a backup for another, but we didn’t like having to coordinate the interaction between so many moving pieces. To solve this issue we went with a convention based approach. Instead of each host having to maintain a list of its partners, we just use Redis to maintain a global lock.
Read more on Flickr's blog, including code excerpts, here.