Real-Time Presence Platform System Design
This article explores the software architecture of real-time user online status indicators. The heartbeat signal checks the client's status in real time.
Join the DZone community and get the full member experience.Join For Free
The system design of the Presence Platform depends on the design of the Real-Time Platform. I highly recommend reading the related article to improve your system design skills.
What Is the Real-Time Presence Platform?
The presence status is a key feature to make the real-time platform engaging and interactive for the users (clients). In layman’s terms, the presence status shows whether a particular client is currently online or offline. The presence status is popular on real-time messaging applications and social networking platforms such as LinkedIn, Facebook, and Slack . The presence status represents the availability of the client for communication on a chat application or a social network.
Figure 1: Online presence status; Offline presence status
Usually, a green colored circle is shown adjacent to the profile image of the client to indicate the client’s presence status as online. The presence status can also show the last active timestamp of the client , . The presence status feature offers enormous value on multiple platforms by supporting the following use cases :
- Enabling accurate virtual waiting rooms for efficient staffing and scheduling in telemedicine
- Logging and viewing real-time activity in a logistics application
- Identify the online users in a chat application or a multi-player game
- Enable monitoring of the Internet of Things (IoT) devices
The following terminology might be helpful for you:
- Node: a server that provides functionality to other services
- Data replication: a technique of storing multiple copies of the same data on different nodes to improve the availability and durability of the system
- High availability: the ability of a service to remain reachable and not lose data even when a failure occurs
- Connections: list of friends or contacts of a particular client
How Does the Real-Time Presence Platform Work?
The real-time presence platform leverages the heartbeat signal to check the status of the client in real-time. The presence status is broadcast to the clients using the persistent server-sent events (SSE) connections on the real-time platform.
Questions to Ask the Interviewer
- What are the primary use cases of the system?
- Are the clients distributed across the globe?
- What is the total count of clients on the platform?
- What is the average amount of concurrent online clients?
- How many times does the presence status of a client change on average during the day?
- What is the anticipated read: write ratio of a presence status change?
- Should the client be able to see the list of all online connections?
- Clients can view the presence status of their friends (connections) in real-time
- 700 million
- 100 million
- 10: 1
- Yes, the connections should be grouped into lists, and the online connections should be displayed at the top of the list.
- Display the real-time presence status of a client
- Display the last active timestamp of an offline client
- The connections should be able to see the presence status of the client
- The client should be able to view the list of online clients (connections)
- High availability
- Low latency
Real-Time Presence Platform Data Storage
The timestamp of the latest heartbeat signal received must be stored in the presence database to identify the last active timestamp of the client. The relational database with support for transactions and atomicity, consistency, isolation, and durability (ACID) compliance can be an overkill for keeping presence status data. The NoSQL database, such as Apache Cassandra, offers high write throughput at the expense of slower read operations due to the usage of an LSM-based storage engine. Hence, Cassandra cannot be used to store the presence status data.
Figure 2: Data schema for user presence status
A distributed key-value store that can support both extremely high read and extremely high write operations must be used for the real-time presence database . Redis is a fast, open-source, and in-memory key-value data store that offers high throughput read-write operations. Redis can be provisioned as the presence database. The hash data type in Redis will efficiently store the presence status of a client. The hash key will be the user ID, and the value will be the last active timestamp.
Real-Time Presence Platform High-Level Design
A trivial approach to implementing the presence platform is to take advantage of clickstream events in the system. The presence service can track the client status through clickstream events and change the presence status to offline when the server has not received any clickstream events from the client for a defined time threshold. The downside of this approach is that clickstream events might not be available on every system. Besides, the change in the client’s presence status will not be accurate due to the dependency on clickstream events.
Prototyping the Presence Platform With Redis Sets
The sets data type in Redis is an unordered collection of unique members with no duplicates. The sets data type can be used to store the presence status of the clients at the expense of not showing the last active timestamp of the client. The user IDs of the connections of a particular client can be stored in a set named connections, and the user IDs of every online user on the platform can be stored in a set named online.
The sets data type in Redis supports intersection operation between multiple sets. The intersection operation between the set online and set connections can be performed to identify the list of connections of a particular client who is currently online.
The set operations, such as adding, removing, or checking whether an item is a set member, take constant time complexity, O(1). The time complexity of the set intersection is O(n*m), where n is the cardinality of the smallest set, and m is the number of sets. Alternatively, the bloom filter or cuckoo filter can reduce memory usage at the expense of approximate results .
Figure 3: Key expiration pattern with sliding window
The client-side failures or jittery client connections can be handled through the key expiration pattern. A sliding window of sets with time-scoped keys can be used to implement the key expiration pattern. In layman’s terms, a new set is created periodically to keep track of online clients. In addition, two sets named current and next with distinct expiry times are kept simultaneously in the Redis server.
When a client changes the status to online, the user ID of the particular client is added to both the current set and the next set. The presence status of the client is identified by querying only the current set. The current set is eventually removed on expiry as time elapses. The trivial implementation of the system is the primary benefit of the current architecture with the sliding window key expiration. The limitation of the current prototype is that the status of a client who gets disconnected abruptly is not reflected in real time because the change in presence status depends on the sliding window length .
Figure 4: Presence platform with Redis sets
The Redis server can make use of Redis keyspace notifications to notify the clients (subscribers) connected to the real-time platform when the presence status changes. The server can subscribe to any data change events in Redis in near real-time through Redis keyspace notifications. The key expiration in Redis might not occur in real-time because Redis uses either lazy expiration on read operation or through a background cleanup process. The keyspace notification gets only triggered when Redis removes the key-value pair. The limitations with keyspace notifications for detecting changes in presence status are the following :
- Redis keyspace notifications consume CPU power
- key expiration by Redis is not real-time
- subscribing to keyspace notifications on the Redis cluster is relatively complex
The heartbeat signal updates the expiry time of a key in the Redis set. The real-time platform can broadcast the change in the status of a particular client (publisher) to subscribers over SSE. In conclusion, do not use the Redis sets approach for implementing the presence platform.
Presence Platform With Pub-Sub Server
The publisher (client) can broadcast the presence status to multiple subscribers through a publish-subscribe (pub-sub) server. The subscriber who was disconnected during the broadcast operation should not see the status history of a publisher when the subscriber reconnects later to the platform.
Figure 5: Presence platform with pub-sub server
The message bus in the pub-sub server should be configured in fire-and-forget (ephemeral) mode to ensure that the presence status history is not stored to reduce storage needs. There is a risk with the fire-and-forget mode that some subscribers might not receive the changes in client status. Redis pub-sub or Apache Kafka can be configured as the message bus. The limitations of using the pub-sub server in the ephemeral mode are the following:
- No guaranteed at least one-time message delivery
- Degraded latency because consumers use a pull-based model
- The operational complexity of message bus such as Apache Kafka is relatively high
In summary, do not use the pub-sub approach for implementing the presence platform.
An Abstract Presence Platform
The real-time platform is a critical component for the implementation of the presence feature. Both the publisher and the subscriber maintain a persistent SSE connection with the real-time platform. The bandwidth usage to fan out the client’s presence status can be reduced by reusing the existing SSE connection.
Simply put, the real-time platform is a publish-subscribe service for streaming the client’s presence status to the subscribers over the persistent SSE connection , , , . The presence platform should track the following events to identify any change in the status of the client , :
- Online: published when a client connects to the platform
- Offline: published when a client disconnects from the platform
- Timeout: published when a client is disconnected from the platform for over a minute
Figure 6: Presence platform; High-level design
The presence status of a client connected to the real-time platform must be shown online. The client should also subscribe to the real-time platform for notifications on the status of the client’s connections (friends). At a very high level, the following operations are executed by the presence platform :
- The subscriber (client) queries the presence service to fetch the status of a publisher over the HTTP GET method
- The presence service queries the presence database to identify the presence status
- The client subscribes to the status of a publisher through the real-time platform and creates an SSE connection
- The publisher comes online and makes an SSE connection with the real-time platform
- The real-time platform sends a heartbeat signal to the presence service over UDP
- The presence service queries the presence database to check if the publisher just came online
- The presence service publishes an online event to the real-time platform over the HTTP PUT method
- The real-time platform broadcasts the change in the presence status of the publisher to subscribers over SSE
The presence service should return the last active timestamp of an offline publisher by querying the presence database. In synopsis, the current architecture can be used to implement a real-time presence platform.
Design Deep Dive
How Does the Presence Platform Identify Whether a User Is Online?
The real-time platform can be leveraged by the presence platform for streaming the change in status of a particular client to the subscribers in real-time , , , . The subscriber establishes an SSE connection with the real-time platform and also subscribes to any change in the status of the connections (clients). The heartbeat signal is used by the presence platform to detect the current status of a client (publisher). The presence platform publishes an online event to the real-time platform for notifying the subscribers when the client status changes to online . The client who just came online can query the presence platform through the Representational state transfer (REST) API to check the presence status of a particular client.
Figure 7: Presence platform checking whether a user is online
The following operations are executed by the presence platform for notifying the subscribers when a client changes the status to online :
- The publisher (client) creates an SSE connection with the real-time platform
- The real-time platform sends a heartbeat signal to the presence service over UDP
- The presence service queries the presence database to check whether an unexpired record for the publisher exists in the database.
- The presence service infers that the publisher just changed the status to online if there is no database record or if the previous record has expired.
- The presence platform publishes an online event to the real-time platform over the HTTP PUT method.
- The real-time platform broadcasts the change in the presence status to subscribers over SSE.
- The presence service subsequently inserts a record in the presence database with an expiry value slightly greater than the timestamp for the successive heartbeat.
Figure 8: Flowchart; Presence platform processing a heartbeat signal
The presence service only updates the last active timestamp of the publisher in the presence database when an unexpired record already exists in the presence database because there was no change in the status of the publisher.
How Does the Presence Platform Identify When a User Goes Offline?
When the publisher doesn’t reconnect to the real-time platform within a defined time interval, the presence platform should detect the absence of the heartbeat signals. The presence platform will subsequently publish an offline event over HTTP to the real-time platform for broadcasting the change in presence status to all the subscribers. The offline event must include the last active timestamp of the publisher .
Figure 9: Presence platform checking whether a user is offline
The web browser can trigger an unload event to change the presence status when the publisher closes the application . A delayed trigger can be configured on the presence service to identify the absence of a heartbeat signal. The delayed trigger will guarantee the accuracy of detection in the status changes. The delayed trigger must schedule a timer that gets executed when the time interval for the successive heartbeat elapses. The delayed trigger execution should query the presence database to check whether the database record for a specific publisher has expired. The following operations are executed by the presence platform for notifying the subscribers when a client changes the status to offline :
- The delayed trigger queries the presence database to check whether the database record of the publisher has expired
- The presence service publishes an offline event to the real-time platform over HTTP when the database record has expired.
- The real-time platform broadcasts the change in status along with the last active timestamp to the subscribers over SSE.
Figure 10: Flowchart; Presence platform using a delayed trigger
The presence service creates a delayed trigger if the trigger doesn’t already exist when the heartbeat is processed. The delayed trigger should be reset in case the trigger already exists .
Figure 11: Actor model in the presence platform
The actor model can be used to implement the presence service for improved performance. An actor is an extremely lightweight object that can receive messages and take actions to handle the messages. A thread will be assigned to an actor when a message must be processed. The thread is released once the message is processed, and the thread is subsequently assigned to the next actor. The total count of actors in the presence platform will be equal to the total count of online users. The lifecycle of an actor depends on the online status of the corresponding client. The following operations are executed when the presence service receives a heartbeat signal :
- Create an actor in the presence service if an actor doesn’t already exist for the particular client.
- Set a delayed trigger on the actor for publishing an offline event when the timeout interval elapses.
- The actor publishes an offline event when the delayed trigger gets executed.
Every delayed trigger should be drained before decommissioning the presence service for improved reliability of the real-time presence platform.
How to Handle Jittery Connections of the Client
The client signing off or timing out will likely have the same status on a chat application. Therefore, the offline and timeout actions of a client can be indicated by the offline event. In IoT at transportation companies, a longer time interval must be set for the timeout to prevent excessive offline events from being published because the region of IoT operation might have poor network connectivity. On the contrary, the IoT in a home security system needs a very short timeout interval for alerts when the monitoring service is down. The offline event can be published by the presence platform for the following reasons :
- The client lost internet connectivity.
- The client left the platform abruptly.
The clients connected to the real-time platform through mobile devices are often on unpredictable networks. The client might disconnect and reconnect to the platform randomly. The presence platform should be able to handle jittery client connections gracefully to prevent constant fluctuations in the client’s presence status, which might result in a poor user experience and unnecessary bandwidth usage .
Figure 12: Presence platform; Heartbeat signal
The real-time platform sends periodic heartbeat signals to the presence platform with the user ID of the connected publisher and a timestamp of the heartbeat in the payload. The presence platform will show the status of the client online when periodic heartbeats are received. The presence status can be kept online, although the client gets disconnected from the network as long as the successive heartbeat is received by the presence platform within the defined timeout interval , .
The serverless functions can be used to implement presence service for scalability and reduced operational complexity. The REST API endpoints of the platform can also be implemented using serverless functions for easy horizontal scaling , .
Scaling the presence platform
The presence platform should be replicated across data centers for scalability, low latency, and high availability. The presence database can make use of conflict-free replicated data type (CRDT) for active-active geo-distribution.
The presence database (Redis) should not lose the current status of the clients on a node failure. The following methods can be used to persist Redis data on persistent storage such as solid-state disk (SSD) , :
- Redis Database (RDB) persistence performs point-in-time snapshots of the dataset at periodic intervals
- Append Only File (AOF) persistence logs every write operation on the server for fault-tolerance
The RDB method is optimal for disaster recovery. However, there is a risk of data loss on unpredictable node failure because the snapshots are taken periodically. The AOF method is relatively more durable through an append-only log at the expense of larger storage needs. The general rule of thumb for improved reliability with Redis is to use both RDB and AOF persistence methods simultaneously .
The network hops in the presence platform are very few because the client SSE connections on the real-time platform are reused for the implementation of the presence feature. On top of that, the pipelining feature in Redis can be used to batch the query operations on the presence database to reduce the round-trip time (RTT) .
The real-time presence platform might seem conceptually trivial. However, orchestrating the real-time presence platform at scale and maintaining accuracy and reliability can be challenging.
Published at DZone with permission of N K. See the original article here.
Opinions expressed by DZone contributors are their own.
Insider Threats and Software Development: What You Should Know
How To Design Reliable IIoT Architecture
Five Java Books Beginners and Professionals Should Read
Reducing Network Latency and Improving Read Performance With CockroachDB and PolyScale.ai