Amazon ElastiCache is Valkey- and Redis OSS-compatible, in-memory cache to power real-time applications with microsecond latency. ElastiCache is often used for demanding workloads and latency-sensitive use cases such as caching, gaming leaderboards, media streaming, and session store, where every microsecond counts.
Modern applications are built as a group of microservices, and the latency for one component can impact the performance of the entire system. Monitoring latency is critical for maintaining optimal performance, enhancing user experience, and maintaining system reliability. In this post, we explore ways to monitor latency, detect anomalies, and troubleshoot high-latency issues effectively for your self-designed (node-based) ElastiCache clusters.
Monitoring latency in Amazon ElastiCache
To achieve consistent performance, you should monitor end-to-end client-side latency, measuring the round-trip latency between Valkey clients and ElastiCache engine. This can help with identifying bottlenecks across the stack. If you observe high end-to-end latency on your cluster, the latency could be incurred through operations on the server-side, client-side, or attributed to increased network latency. The following diagram illustrates the difference between the round-trip latency measured from the client and the server-side latency.
To observe the server-side latency, we introduced the SuccessfulReadRequestLatency and SuccessfulWriteRequestLatency CloudWatch metrics, which provide a precise measure of the time that the ElastiCache for Valkey engine takes to respond to a successfully executed request in microseconds. These new metrics are available from ElastiCache version 7.2 for Valkey or newer for self-designed clusters. These metrics are computed and published once per minute for each node. You can aggregate the metric data points over specified periods of time using various CloudWatch statistics such as Average, Sum, Min, Max, SampleCount, and any percentile between p0 and p100. The sample count includes only the commands that were successfully executed by the ElastiCache server. Additionally, the ErrorCount
metric is the aggregated count of commands that failed to execute.
The SuccessfulReadRequestLatency
and SuccessfulWriteRequestLatency
metrics measure the latencies across various stages of command processing including preprocessing, command execution, and postprocessing stages within the ElastiCache for Valkey engine. The ElastiCache engine uses enhanced I/O threads to handle processing network I/O from concurrent client connections. The commands are then queued up to be executed by the main thread sequentially. The responses are sent back to the clients through I/O write threads, as illustrated in the following diagram. The request latency metrics capture the time taken to process the request completely through these various stages. The existing command-specific latency metrics such as GetTypeCmdsLatency
and SetTypeCmdsLatency
measure only the CPU time taken to execute the core command logic.
The SuccessfulReadRequestLatency
and SuccessfulWriteRequestLatency
metrics are useful when troubleshooting performance issues with your application using an ElastiCache cluster. In this section, we present practical examples of how to put these metrics to use.
Let’s say you are observing a high write latency while writing data into your application using an ElastiCache cluster. We recommend inspecting the SuccessfulWriteRequestLatency
metric, which provides the time taken to process all successfully executed write requests in 1-minute time intervals. If you observe elevated latency on the client side but no corresponding increase in the SuccessfulWriteRequestLatency
metrics, then it’s a good indicator that the ElastiCache for Valkey engine is unlikely to be the primary cause of high latency. In this scenario, inspect the client-side resources such as memory, CPU, and network utilization to diagnose the cause. If there is an increase in the value of SuccessfulWriteRequestLatency
metrics, then we recommend following the steps in the next section to troubleshoot the issue.
For most use cases, we recommend that you monitor the p50 statistics of the SuccessfulReadRequestLatency
and SuccessfulWriteRequestLatency
metrics. If your application is sensitive to tail latencies, we recommend that you monitor the p99 or p99.99 statistic.
The following screenshot shows a comparison of the maximum statistic of SetTypeCmdsLatency
metric and the p99 of SuccessfulWriteRequestLatency
metric in microseconds for an ElastiCache for Valkey cluster during 1-minute time periods. We notice that the time taken to execute a burst of write requests indicated through the SuccessfulWriteRequestLatency
metric increases between 15:41 and 15:43 UTC since they are queued up to be executed by the main thread sequentially, while the CPU time taken to execute the command represented by SetTypeCmdsLatency
metric remains steady.
Troubleshooting high latency in ElastiCache
To troubleshoot high read/write latencies in your self-designed (node-based) ElastiCache for valkey clusters, inspect the following aspects of your cluster.
Long-running commands
Open-source Valkey and ElastiCache for Valkey engines run commands in a single thread. If your application is running expensive commands on large data structures such as HGETALL
or SUNION
, a slow execution of these commands could result in subsequent requests from other clients to wait, thereby increasing application latency.
The existing command-specific latency metrics such as GetTypeCmdsLatency
and SetTypeCmdsLatency
measure only the CPU time taken to execute the core command logic and don’t reflect the duration of time the command is waiting to be prioritized. However, the SuccessfulWriteRequestLatency
and SuccessfulReadRequestLatency
metrics includes the overall server-side latency, including the time taken to read from the socket, queueing time, command processing, and time taken to write back to the socket. If its p99/p100 statistic value is high (perhaps, relative to GetTypeCmdsLatency
or SetTypeCmdsLatency
), you can use the Valkey SLOWLOG command or investigate the SLOWLOG
for your cluster streamed to ElastiCache log delivery destinations to help determine which commands took longer to complete. The Valkey SLOWLOG
contains details on the queries that exceed a specified runtime, and this runtime includes only the command processing time. If there are no anomalies in the SLOWLOG
, it indicates a spike in server-side latency introduced due to factors outside of the command processing latency.
Queueing time
ElastiCache for Valkey actively manages customer traffic to maintain optimal performance and replication reliability. High traffic is throttled when more commands are sent to the node than can be processed by the ElastiCache engine and is indicated by the TrafficManagementActive metric. If the TrafficManagementActive
metric remains active and the latency metrics remain high for an extended period of time, evaluate the cluster to decide if scaling up or scaling out is necessary. At times, a temporary burst in traffic could also cause high queuing time, resulting in high tail latency.
Memory utilization
An ElastiCache for Valkey node under memory pressure might use more memory than the available instance memory. In this situation, Valkey swaps data from memory to disk to free up space for incoming write operations. To determine whether a node is under memory pressure, review whether the FreeableMemory
metric is low or the SwapUsage
metric is greater than FreeableMemory
. High swap activity on a node results in high request latency. If a node is swapping because of memory pressure, then scale up to a larger node type or scale out by adding shards. You should also make sure that sufficient memory is available on your cluster to accommodate traffic burst.
Data tiering
ElastiCache for Valkey provides data tiering as a price-performance option for Valkey workloads where data is tiered between memory and local SSD. Data tiering is ideal for workloads that access up to 20% of their overall dataset regularly and for applications that can tolerate additional latency when accessing data on SSD.
The metrics BytesReadFromDisk
and BytesWrittenToDisk
, which indicate how much data is being read from and written to the SSD tier, can be used in conjunction with the SuccessfulWriteRequestLatency
and SuccessfulReadRequestLatency
metrics to determine the throughput associated with the tiered operations. For instance, if the value of the SuccessfulReadRequestLatency
and BytesReadFromDisk
metrics are high, it might indicate that the SSD is being accessed more frequently relative to memory and you can scale up to a larger node type or scale out by adding shards so that more RAM is available to serve your active dataset.
Horizontal scaling for Valkey
By using online resharding and shard rebalancing for ElastiCache, you can scale your clusters dynamically with no downtime. This means that your cluster can continue to serve requests even while scaling or rebalancing is in process. If your cluster is nearing it’s capacity, the client write requests are throttled to allow scaling operations to proceed, which increases the request processing time. This latency is also reflected in the new metrics. It is recommended to follow the best practices for online cluster resizing such as initiating resharding during off-peak hours, and avoid expensive commands during scaling.
Elevated number of client connections
An ElastiCache for Valkey node can support up to 65,000 client connections. A large number of concurrent connections can significantly increase the CPU usage, resulting in high application latency. To reduce this overhead, we recommend following best practices such as using connection pools, or reusing your existing Valkey connections.
Conclusion
Measuring latency for an ElastiCache for Valkey instance can be approached in various ways depending on the level of granularity required. Monitoring end-to-end latency from the client-side helps with identifying issues across the datapath, while the request latency metrics captures the time taken across various stages of command processing, including command preprocessing, command execution, and command postprocessing.
The new request latency metrics provide a more precise measure of the time the ElastiCache for Valkey engine takes to respond to a request. In this post, we discussed a few scenarios where these latency metrics could help in troubleshooting latency spikes in your ElastiCache cluster. Using the details in this post, you can detect, diagnose, and maintain healthy ElastiCache for Valkey clusters. Learn more about the metrics discussed in this post in our documentation.
About the author
Yasha Jayaprakash is a Software Development Manager at AWS with a strong background in leading development teams to deliver high-quality, scalable software solutions. Focused on aligning technical strategies with innovation, she is dedicated to delivering impactful and customer-centric solutions.
Source: Read More