This is a guest post by Shuxiang Zhao, Head of Technology, and Haoyang Yu, Backend Platform Engineer at Habby, in partnership with AWS.
Habby is a game studio that creates interactive entertainment to connect players worldwide. Our name combines “Hobby” and “Happy,” reflecting our mission to bring joy through gaming experiences. Player satisfaction drives everything we do—we believe in creating happiness through the games we develop and promote. We strive to deliver engaging experiences that foster meaningful connections among our global player community.
Our gaming studio offers robust chat functionality to enhance player interactions, including unicasting (one-to-one communication), broadcasting (one-to-many communication), and multicasting (group communication). Our system uses the Amazon ElastiCache for Redis OSS publish/subscribe (Pub/Sub) functionality for the chat message sending. However, we faced challenges with connection stability during infrastructure changes, such as instance scaling, Redis OSS version upgrades, and hardware failures. These issues would force client reconnections, resulting in lost messages and diminished player experience.
We adopted Valkey GLIDE, a client library for Amazon ElastiCache for Valkey and Redis OSS, to address our system challenges. Valkey GLIDE is an AWS-backed open source project designed for reliability, optimized performance, and high-availability, for Valkey and Redis OSS based applications. It is a multi-language client pre-configured with best practices learned from over a decade of operating Redis OSS-compatible services used by hundreds of thousands of AWS customers. During failover testing, the solution maintained near-zero disruptions while handling 500,000 concurrent players and processing 100,000 queries per second (QPS). Implementation required only two weeks of development effort and significantly improved system reliability. The solution leverages Amazon ElastiCache’s architecture, which supports up to 500 nodes and includes sharded publish/subscribe functionality, allowing us to scale both player capacity and query processing. This infrastructure serves as a foundation for our distributed network middleware, enabling seamless communication across our distributed service architecture.
This post describes our messaging system architecture and explains how we improved system reliability by using Valkey GLIDE as the client communicating with Amazon ElastiCache.
Message delivery system architecture
The message delivery system interacts with both game players and the Gaming Instant Messaging (IM) service. For game players, it enables login and inter-player communication. For the IM service, it manages bidirectional message flow, allowing both receipt and transmission of messages as required by the game logic.
The message delivery system is implemented using a layered architecture:
- WebSocket servers operate in Amazon Elastic Kubernetes Service (Amazon EKS) behind an Application Load Balancer (ALB). These servers manage player interactions, including login authentication, connection management, and message handling.
- The IM service handles incoming requests from WebSocket servers and manages two primary functions. First, it processes player requests for group joining and message sending. Then, after executing its business logic, it invokes the REST API Server to deliver the corresponding messages.
- REST API servers in Amazon EKS support three core functions: player login to WebSocket servers, group membership management, and message transmission.
- The Amazon ElastiCache cluster manages message delivery between WebSocket and REST API servers through two core functions. First, as a metadata storage system, it maintains mappings among players, WebSocket servers, and chat groups, tracking relationships such as player-to-channel, player-to-group, and group-to-channel. Second, it handles message exchange using Pub/Sub functionality, where WebSocket servers subscribe to specific channels and REST API servers publish messages. When REST API servers receive a message request, they parse the message, query meta store for target channels, and publish the message to appropriate channels.
The system has three phases in message delivery: pre-registration, message processing, and message delivery.
Phase 1: Pre-registration
This phase establishes initial connections and channel mappings:
- Game players register through the ALB to connect with a WebSocket server.
- Each WebSocket server maps to a unique channel and sends a subscription request to Amazon ElastiCache cluster.
- The WebSocket server sends a “player login WebSocket server” request to the REST API server.
- The REST API server stores player-to-channel mappings in Amazon ElastiCache cluster.
Phase 2: Message processing and distribution types
Message processing in the system operates in three distinct ways: unicast, broadcast, and multicast distribution.
In unicast distribution, when Player A sends a message to Player B, the message flows through a specific path. It starts from the player, moves to the WebSocket server, continues through IM, and reaches the REST API server. The REST API server then queries Amazon ElastiCache cluster to identify Player B’s channel. Once identified, it publishes the message to Amazon ElastiCache cluster, where Player B’s WebSocket server receives it through the Pub/Sub mechanism.
Broadcast distribution occurs when Player A needs to send a message to all players. The message follows the same initial path: from player to WebSocket server, through IM, to the REST API server. The REST API server retrieves all available channels from Amazon ElastiCache cluster and publishes the message to every channel in Amazon ElastiCache cluster. This makes sure WebSocket servers receive the message for distribution.
Multicast distribution, used for group messaging, operates in two phases. In the first phase, when a player joins a group, the join request travels from the WebSocket server through IM to the REST API server. The REST API server then updates Amazon ElastiCache cluster with two crucial mappings: player-to-group and group-to-channel. In the second phase, when a player sends a group message, it follows the standard path through WebSocket server, IM, and REST API server. The REST API server queries the Amazon ElastiCache cluster for the channels mapped to the group and publishes the message to these channels in Amazon ElastiCache cluster. Only the WebSocket servers serving players within the specified group receive and distribute the message.
Phase 3: Message delivery to players
The WebSocket server manages final message delivery through a subscription-based mechanism. The delivery process varies based on the message type:
- Unicast – For unicast messages, the process is straightforward. The WebSocket server receives the message from Amazon ElastiCache cluster through its channel subscription and delivers it directly to Player B, the intended recipient.
- Broadcast – In broadcast scenarios, after receiving the message from Amazon ElastiCache cluster, the WebSocket server distributes it to all connected players within its network. This provides system-wide message propagation.
- Multicast – For multicast messages targeting specific groups, the WebSocket server first queries Amazon ElastiCache cluster to retrieve the complete player roster for the target group. It then delivers the message exclusively to those players identified as group members.
This subscription-based architecture provides precise message routing and efficient delivery while maintaining system scalability. By using the Pub/Sub capabilities of Amazon ElastiCache for Redis OSS, the system minimizes latency and optimizes resource utilization during message distribution.
The challenge
The message delivery system demands enterprise-level performance and reliability to serve players worldwide. The system operates on Amazon EKS and integrates with Amazon ElastiCache for message handling. Both WebSocket servers and REST API servers must automatically detect Amazon ElastiCache topology changes, adapt seamlessly to node failures, and efficiently handle operating system (OS) patches and software updates. Additionally, the REST API server must maintain precise tracking of subscriber counts and provide accurate message delivery verification.
To support these requirements, the Redis OSS/Valkey client must provide real-time topology change detection, subscriber count tracking, scalable Pub/Sub implementation, and automated failover management. These capabilities are fundamental to maintaining our system’s reliability and performance standards.
We evaluated several Node.js libraries for Redis OSS/Valkey, such as ioredis, node-redis, and more, but identified critical reliability limitations in these solutions. Key shortcomings included their inability to automatically recover from topology failures and the lack of message delivery confirmation for publishers. Though developing a custom Redis OSS/Valkey client seemed like a potential alternative, this approach would have necessitated building sophisticated reconnection logic, topology change management, and comprehensive failover handling mechanisms. The substantial development complexity and ongoing maintenance requirements made this option impractical for our production environment.
Valkey GLIDE to the rescue
Fortunately, we discovered Valkey GLIDE, an AWS-supported open source project written in Rust that provides native support for Java, Python, and Node.js. We selected Valkey GLIDE for the following key benefits:
- Robust failover system – The system maintains dedicated connection pools for subscriptions and marks original connections with specific identifiers. During failures, it automatically establishes new connections, and we have verified connection stability during Amazon ElastiCache version updates.
- Primary node subscription – Valkey GLIDE enables direct subscription to primary nodes, allowing publishers to confirm message delivery status, thereby providing reliable message tracking and verification.
- Customizable retry configuration – The system offers configurable retry logic with adjustable intervals. You can customize retry timeframes, typically between 5–10 seconds, to match your application requirements.
- Independent subscription client – The subscription client operates as a standalone component with unified responsibility for subscription management. This separation of concerns results in cleaner, more maintainable code architecture.
Valkey GLIDE provides a unified interface for Redis OSS/Valkey Pub/Sub support, seamlessly handling sharded, cluster, and standalone setups. One of its key features is real-time topology management, which maintains continuous subscriptions even during connectivity issues. This capability enables Valkey GLIDE to implement automatic reconnection and resubscription mechanisms. Valkey GLIDE maintains subscription health through periodic checks of both connections and topology. It also employs an adaptive mechanism to update topology information in response to server errors. These proactive measures allow Valkey GLIDE to continuously maintain an up-to-date topology map, facilitating swift resubscriptions in the event of connection errors or topology changes. This robust design makes sure that Valkey GLIDE offers a resilient and adaptable solution for Redis OSS/Valkey Pub/Sub implementation across various configurations, providing enhanced reliability even in dynamic network environments.
Use Valkey GLIDE for Pub/Sub functionality in a Node.js application
In this section, we explore how to implement a sample player messaging system using Valkey GLIDE and Amazon ElastiCache. We demonstrate how to use Redis OSS/Valkey’s Pub/Sub capabilities in a clustered environment, with a focus on best practices and reliable implementation patterns.
Let’s break down the key components of our messaging system.
Message interface
First, we define our message structure:
Each message contains sender information, message content, and timestamp. When a player sends a message, the message is packaged with sender details and timestamp. It’s sent through encrypted channels to Amazon ElastiCache. Amazon ElastiCache distributes the message to intended recipients. Recipients receive and display the message in real time.
Player class implementation
The Player
class serves as our main component and handles individual player connections and messaging between players. Each instance of Player
class represents a player with a unique ID and manages their connection to a ElastiCache cluster through GlideClusterClient
. See the following code:
Player connection management
The Player
class manages connections through the connect()
and disconnect()
methods, where connect()
establishes a secure TLS connection to the ElastiCache cluster using GlideClusterClient
with subscriber configuration that includes both player-specific and global channel subscriptions. The connection management includes comprehensive error handling that provides proper cleanup through disconnect()
if connection fails, and the disconnect()
method safely closes the client connection and nullifies the client reference, with both methods providing appropriate console logging for connection status and errors.
When a player joins the game, the system automatically:
- Creates encrypted connections to Amazon ElastiCache
- Sets up message channels
- Each player gets two communication channels:
- A private channel for direct messages
- A global channel for broadcast messages
- Connections use encryption (TLS) for security
- Handles connection errors
See the following code:
Connection configuration
The connection configuration is structured in three layers through getBaseConfig()
, getPublisherConfig()
, and getSubscriberConfig()
methods, where the base configuration establishes core connection parameters including Amazon ElastiCache endpoint and TLS security settings. The subscriber configuration extends the base config by adding Pub/Sub channel subscriptions for both player-specific (player:${playerId}
) and global channels, along with read preferences set to preferReplica
and a message handling callback. The configuration system provides secure and efficient ElastiCache cluster connectivity while maintaining separation of concerns between publishing and subscription functionalities.
A secure connection setup with Amazon ElastiCache achieves the following:
- Establishes core connection parameters
- Enforces TLS for security
- Uses cluster configuration endpoint for high availability
See the following code:
Subscriber configuration
The subscriber configuration, implemented in getSubscriberConfig()
, extends the base configuration by adding specific Pub/Sub channel subscriptions and read preferences for the ElastiCache cluster connection. It sets up exact channel matching for both player-specific channels (player:${playerId}
) and a global channel ('global'
), while configuring the system to prefer replica nodes for read operations through the preferReplica
setting. The configuration binds the handleMessage
callback to process incoming messages and inherits the secure TLS settings from the base configuration, creating a complete subscription setup for real-time message handling.
Channel subscription setup for message routing achieves the following:
- Implements read replica preference for load distribution
- Uses channel pattern matching for message routing
- Configures two subscription channels:
- Player-specific channel (
player:${playerId}
) - Global broadcast channel (
'global'
)
- Player-specific channel (
- Binds message handler to maintain correct context
See the following code:
Connection management
The connection management system establishes and maintains secure player connections through the GLIDE cluster client, using TLS encryption and Amazon ElastiCache clustering capabilities, where each connection is configured with both private (player:${playerId}
) and global
broadcast channels. Upon connection initialization, the system creates a client instance with subscriber configurations that define channel subscriptions and message handling callbacks, while implementing comprehensive error handling that includes resource cleanup through the disconnect()
method. The connection state is actively monitored through the client instance (private client: GlideClusterClient | null = null
), maintaining proper resource management and graceful disconnection handling, with connection events being logged for system monitoring and troubleshooting purposes.
When a player initiates a connection, the system:
- Creates a GLIDE cluster client
- Establishes subscription channels for each player
- Handles connection errors
- Logs connection status
See the following code:
The asynchronous disconnect()
function is responsible for safely disconnecting a player from the system. It first checks if a client connection exists, then closes the connection and sets the client reference to null to release resources. A log message confirms the disconnection, and errors encountered during the process are caught and logged for debugging. This provides graceful handling of player disconnections while maintaining system stability.
See the following code:
Sending and receiving messages
Messages are sent through two primary methods: direct player-to-player communication using
sendTo(targetPlayerId, content)
or broadcast messages using sendGlobal(content)
, where each message is packaged with sender information, content, and a timestamp before being published through Pub/Sub channels. On the receiving end, messages arrive through subscribed channels (either player-specific or global), where the handleMessage
function processes them by converting from binary format, validating the content, and delivering to the appropriate player’s interface with proper formatting and timing information. This bidirectional flow is managed through secure, encrypted connections, with automatic error handling and delivery confirmation, providing reliable real-time communication between players while maintaining system stability and performance.
When a player sends a message, the system:
- Validates the connection state
- Constructs the message with sender details, message content, and timestamp
- Performs JSON serialization of the message object and publishes the message to a channel
- Asynchronously delivers the message to subscribers of the channel
- Recipients receive and display the message in real time
- Performs error handling for disconnected states
The following are examples of the two types of messages:
- Global messaging or broadcast to a global channel:
- Player sends a global message
- System broadcasts to all connected players
- Everyone receives the announcement
The following is an example of a broadcast message to all players:
- Direct messaging to specific player based on a
PlayerID
:- Player1 writes a message
- System packages it with sender information, timestamp, and message content
- Message gets delivered to Player2
The following is an example of a player sending a direct message:
Processing incoming messages
When a message arrives, the handleMessage
function processes it by converting the raw message from a binary format to a string, then parses it into a strongly typed Message
object containing sender information, content, and timestamp data. The system then determines the message channel type (either global for broadcasts or player-specific for direct messages) and routes it accordingly, while maintaining secure delivery channels and implementing error handling at each step. The processed message is then delivered to the intended recipients with proper formatting and timing information, all managed through Redis OSS/Valkey’s Pub/Sub capabilities, providing reliable and scalable real-time communication.
See the following code:
Complete message flow example
The complete message flow consists of the following stages:
- Initialization – Player instances are created and connections are established for each player.
- Communication – Players send direct or broadcast messages. For example:
- Direct message: Alice to Bob
- Global broadcast: Bob to all players
- Message processing – Messages get processed and delivered to appropriate channels. Subscribers receive all messages.
- Cleanup – Connections are gracefully terminated and resources are cleaned up.
See the following code:
You can either poll for messages using the non-blocking tryGetPubSubMessage()
, or wait for a message using the asynchronous getPubSubMessage()
method. For more details, refer to PubSub Support.
Results
Valkey GLIDE migration proved efficient, with minimal disruptions during failover testing. The system currently handles 500,000 concurrent players, processing messages every 5 seconds for a 100,000 QPS load. The 500-node capacity and sharded Pub/Sub architecture of Amazon ElastiCache provide scalability for future expansion.
The migration process, including code refactoring and testing, was completed in two weeks and delivered significant reliability improvements. Our comprehensive failover testing included engine upgrades and confirmed system stability with predictable minimal application disruptions. The Valkey GLIDE client allows configuration of retry intervals, providing flexibility to balance between unavailability windows and performance. The current message delivery system efficiently handles 500,000 concurrent players, each sending one message every 5 seconds. This results in approximately 100,000 Amazon ElastiCache queries per second.
Looking ahead, the highly scalable architecture of Amazon ElastiCache, paired with its sharded Pub/Sub design, strategically positions our message delivery system for robust growth and high performance. This infrastructure not only supports our existing player base but also provides the foundation for expanding to additional systems, providing long-term scalability and reliability.
Conclusion
The implementation of Valkey GLIDE with Amazon ElastiCache has significantly improved our system’s reliability. The fault-tolerant architecture provides automatic reconnection capabilities and real-time message delivery monitoring, substantially reducing service disruptions. Valkey GLIDE includes Availability Zone (AZ) affinity and supports multiple programming languages including Java, Python, and Node.js. Based on these improvements in system reliability and performance, we recommend Valkey GLIDE for similar high-availability requirements.
About the Authors
Shuxiang Zhao is the Head of Technology at Habby, with over 15 years of experience in software development and system architecture. He specializes in designing and managing large-scale game backend platforms, with deep expertise in AWS services, including Amazon EKS, DynamoDB, Aurora, and ElastiCache. He is skilled in building high-availability, high-concurrency systems.
Haoyang Yu is a Backend Platform Engineer at Habby, with 7 years of professional experience in software development. He has previously worked as a Game Development Engineer, where he gained valuable experience in designing and implementing efficient systems for interactive and dynamic applications. Currently, Haoyang focuses on Habby’s backend architecture, particularly contributing to the integration of the IM gateway layer.
Lili Ma is a Senior Database Solutions Architect at AWS with over 10 years of specialized experience in database technologies. Her background spans multiple database paradigms, including NoSQL systems (Hadoop/Hive R&D), enterprise databases (IBM DB2), distributed data warehousing (Greenplum and Apache HAWQ), and cloud-native databases (Amazon Aurora, Amazon ElastiCache, and Amazon MemoryDB). She leverages this diverse expertise to design and implement optimized database solutions for clients.
Xin Zhang is a Senior Solutions Architect at AWS, with over 15 years of experience in the gaming industry and 5 years in the VoIP communications sector. Before joining AWS, Xin worked on the frontline of PC MMORPG and mobile game development, including in-house game engine R&D. With 9 years of AWS experience, Xin is currently focused on leveraging Generative AI and Generative Development to enhance Unreal Engine workflows and boost game developer productivity.
Siva Karuturi is a Sr. Specialist Solutions Architect for In-Memory Databases. Siva specializes in various database technologies (both Relational & NoSQL) and has been helping customers implement complex architectures and providing leadership for In-Memory Database & analytics solutions including cloud computing, governance, security, architecture, high availability, disaster recovery and performance enhancements. Off work, he likes traveling and tasting various cuisines Anthony Bourdain style!
Source: Read More