Case Study - How Last.fm Uses HornetQ for Their Streaming Infrastructure
This case study describes how Last.fm uses HornetQ to improve the performance and availability of its streaming infrastructure.
What Is Last.fm?
Last.fm is an online music service that tracks the music people listen to and generates recommendations and custom online radio stations based on this information. It allows user to track the song they listen to (by scrobbling the song) from Internet radio station, music player or portable devices. The scrobbles are transferred to Last.fm and displayed on the user’s profile page.
Every track that is scrobbled by user plays will tell Last.fm something about what he or she likes. It can connect the user to other people who like the same song, artist, genre, etc. and recommend other songs from their music collections.
More than 40 million of unique users visit Last.fm each month and view 500 million pages.
Since the launch of the service, there has been more than 40 billion of scrobbles. Nowadays, there is 40 million of scrobbles a day (up to 800 per second at peak).
Last.fm infrastructure is using JMS, the Java standard for message service, to control its streaming service.
Last.fm recently migrated to HornetQ to improve its messaging service and ensure that it meets the quality of service expected by their users.
What Is HornetQ?
HornetQ is a community project from JBoss, the middleware division of Red Hat. HornetQ is the default messaging service for JBoss Application Server 6 and it can also be used as a standalone JMS server or embedded in any Java application. HornetQ is an Open Source (Apache-licensed) project to build a multi-protocol, embeddable, clustered messaging system with very high performance and availability.
Having some issues with their current messaging server, Last.fm was interested to integrate HornetQ in their infrastructure to improve performance and availability while keeping hardware resources under control.
Last.fm backends are written in both C++ and Java, the latter being used for its streaming infrastructure. The Java streaming backends are composed of a web application deployed on Tomcat (the streamer manager) and standalone Java applications (the streamers).
The streamers are designed to be simple with as few dependencies as possible, requiring no containers at all. Architecturally, they are independent of the rest of Last.fm infrastructure. They use JMS to communicate within themselves and with the other parts of the infrastructure in a loosely-coupled fashion.
For its streaming infrastructure, Last.fm uses JMS in three different use cases.
streamer control messages are used when the streamers are upgraded. The streamers send control messages on the /queue/ControlQueueJMS queue when they are started up or shut down.
The streamer manager will consume these messages and starts or stops sending traffic to the streamers accordingly.
This generates a low volume of messages (few messages a month).
Every time a user finishes listening to a song, a radio control message is sent from the streamer to the /queue/radioListenQueue JMS queue. On reception of these message, the streamer manager updates the database. This allows to limit users without a subscription to a given number of free listens.
This generates a high volume of messages (every time a user finishes listening to a song)
Each streamer keeps a list of the users connected to it. When a user starts a stream, a connection control message is sent to the /topic/connectionControl JMS topic that all the other streamers are subscribed to. When a user starts listening to a song, all streamers are informed of that and the streamer the user is streaming from stores the id. If the user then attempts to open a second stream, the second streamer will put the id on the topic, the first streamer will then pick that up and see the user is already connected to it and kick them off. This effectively limits users to one stream at a time.
The streamers use a publish/subscribe topology so that each streamer receives connection control messages sent by all of them.
In the diagram above, David is already connected to the streamer #2 and then tries to connect to another streamer.
- Streamer #1 get the new connection request from David
- Streamer #1 notifies the JMS /topic/connectionControl
- All streamers (which are subscribed to the topic) are informed that David requested a new connection
- Streamer #2 notices that it was already serving David and drops the connection.
In the same time, streamer #1 starts to serve David
Like the radio scenario, this generates a high volume of messages.
The number of messages is directly proportional to the amount of listening traffic as each new connection request results in a message to the topic. As Last.fm increases the number of physical streamer boxes, they also increase the number of subscribers and consumers on the topic (each streamer acting as both roles). The JMS message volume follows the daily/weekly/monthly streaming traffic peaks and will also increase over time as Last.fm expand their streaming capacity. JMS scalability is a main requirement for Last.fm. By switching to HornetQ, they except its performance and scalability to meet their needs as their streaming capacity grows.
Last.fm do not want to interrupt a normal user’s streaming experience under any circumstance and would rather have messages stop and temporarily allow multiple streams for one user than have a JMS problem interfere with streaming in any way.
Last.fm Requirements and Settings
Last.fm main requirements for their streaming infrastructure are maximizing uptime and keeping hardware resources (memory, CPU and disk usage) under control. In the worst case, they prefer to lose messages than have anything crash or prevent to serve user requests.
To increase uptime and have stable resource usage, they disabled message persistence from HornetQ server configuration. This ensures that HornetQ does not write any messages to the disk:
Non-persistent messages will not survive the crash of HornetQ server or its clients. In Last.fm case, this is acceptable as their control messages are transitory by nature. For example, a connection control message from a user will supersede any other previous messages for the same user, it is not necessary to persist these kind of control messages.
They also use a DROP policy: if the amount of message size for a given JMS destination is over a configured threshold, HornetQ just drops new messages sent to this address rather than congesting the server:
With this setting, HornetQ will drop messages sent by JMS producers if the total size of all messages in memory was above 100MiB for any address (identified by the wildcard #).
Other policies supported by HornetQ are:
- PAGE - after a configurable threshold, HornetQ writes messages on disk (in page files) instead of filling the memory and eventually cause a memory overflow.
- BLOCK - HornetQ producers clients are blocked until there is enough memory on the server to be handle their messages and deliver them.
Another optimization was to use pre-acknowledgement to increase performance. Pre-acknowledged messages are acknowledged by HornetQ server before they are sent to the client consumer:
Pre-acknowledgement prevents reliable delivery (as the server will not know whether the consumers effectively received the messages or not) but increases performance as it avoids an additional network roundtrip from the client to the HornetQ server to send the acknowledgement
HornetQ’s pre-acknowledgement mode is in addition to those provided by JMS (auto-ack, dups-ok-ack, and client-ack).
Last.fm Architecture Refactoring
Prior to use HornetQ, Last.fm ran another Open Source JMS messaging system. However they had thread issues with it.
At first, Last.fm was using a thread pool. When a stream request came in, a thread was created from this pool to service the request. The thread did authentication, perform some actions, locate the file to stream, etc. At the end, the thread would then send JMS messages. With this architecture, Last.fm had hundreds or thousands of threads all sending messages.
When the previous messaging system misbehaved from time to time, this caused the threads to block and brought down the whole system to its knees as theses threads were never meant to block.
Last.fm decided to switch to HornetQ to ensure that their resources would not been blocked whatsoever. In parallel, they also refactored and simplified their architecture. They now use a separate single JMS thread for sending messages and put messages to an internal in-memory bounded queue. The JMS thread would then pull messages off that queue and send them to the JMS server. In the unlikely case that sending the JMS message blocks, only that JMS thread would block. As the in-memory queue is bounded, new messages would fall off instead of taking up more RAM and making the problem get worse.
While switching to HornetQ, Last.fm helped HornetQ team to identify bugs which were causing blocking issues. These bugs were fixed prior to HornetQ 2.0.0 release in January 2010. Since the switch to HornetQ 2.0.0. and the changes to their architecture, Last.fm messaging infrastructure has been up for months without any performance issue.
Last.fm main concern to switch to a new messaging system was about uptime. Performance were important but it was essential that the messaging server was not crashing, not running out of memory or disk. Last.fm was willing to throw away messages in order to stay up: Availability was more important than reliable delivery.
Last.fm’s HornetQ server is hosted on a 2GHz Dual Core AMD with 8GB of RAM and runs Debian Linux. Other services are hosted on this server but the whole memory usage is flat around 1GB or RAM and CPU usage follows a daily through/peak from about 4% to 6%.
Both memory and CPU usage have been stable over the last month without any variation. Their usage is predictable and the CPU peaks match Last.fm usage peaks.
The number of streaming-related messages handled by HornetQ varies depending on the day of week, time of month and other factors. At peak, HornetQ handles 20000 messages per minute (around 330-350 per second).
After facing stability and downtime issues with their previous messaging system, Last.fm decided to switch to HornetQ to have a better use and control of their JMS resources for their streaming infrastructure.
Since this switch, and the refactoring of their own code to sanitize the use of JMS resources, hardware resources have been kept under control, maximum uptime has been achieved while performance and availability continue to follow Last.fm usage. With HornetQ, Last.fm is now prepared to scale its messaging system to follow its streaming capacity and ultimately its business growth.