Amazon Aurora PostgreSQL-Compatible Edition is a managed relational database service that combines the performance and availability of commercial databases with the simplicity and cost-effectiveness of open source databases. Built on a fully distributed, fault-tolerant, and self-healing storage system, Aurora PostgreSQL-Compatible is ideal for high-performance applications.
Monitoring the health of Aurora PostgreSQL instances is crucial for maintaining optimal performance, preserving reliability, and preventing potential issues before they impact your applications. In AWS, we have been using Amazon CloudWatch and Amazon RDS Performance Insights to monitor and troubleshoot individual RDS instances. Users can create alarms and build CloudWatch Dashboards to observe database metrics, and use CloudWatch Log Insights to filter events and errors.
However, monitoring a large deployment of Aurora PostgreSQL instances poses significant challenges, particularly when resources are limited. The sheer volume of data and the complexity of managing thousands of instances make manual health checks impractical for a small team of database administrators. Administrators need a comprehensive view of the overall health of the fleet, identifying how many instances are healthy and how many are not. They also need to collect performance data over extended periods to identify persistent issues. An automated, scalable, and streamlined monitoring solution that assists them in focusing on specific issues with unhealthy instances is needed.
Recently, AWS released CloudWatch Database Insights, addressing performance monitoring at scale for Amazon Aurora. With Database Insights, you can monitor your database fleet with pre-built, opinionated dashboards. To help you analyze the performance of your fleet, the Database Insights dashboards display curated metrics and visualizations. By presenting metrics in a single dashboard for all databases in your fleet, Database Insights allows you to monitor your databases simultaneously.
For example, you can use Database Insights to find a database that is performing poorly within a fleet of hundreds of database instances. You can then choose that instance and use Database Insights to troubleshoot issues.
While Database Insights provides a real-time view of the health of your database fleet, in this post, we present a solution that enables you to track historical periods when database instances exceeded thresholds and identify database instances that have the most frequent alarms for exceeding thresholds. You can define your own standards for what constitutes an unhealthy instance and configure the observation period to establish historical trends over time. Summary reports and visuals provide an overall view of the general status of the entire fleet. By logging the performance data for the potentially unhealthy instances into system tables, you can focus on a smaller subset of instances, dive into the details of the findings, and use the built-in troubleshooting guide as a reference. All of this makes troubleshooting less complicated, reducing the overall monitoring effort for customers with large fleets of databases.
Solution overview
The following architecture diagram depicts the workflow.
The solution has components to collect and report on Aurora PostgreSQL metrics. For collection, Aurora instances are configured to publish logs to Amazon CloudWatch, enabling real-time collection and tracking of metrics. Amazon EventBridge is used to invoke an AWS Lambda function on a regular interval, and Amazon DynamoDB serves as the data store. The Lambda function calls the GetMetricStatistics CloudWatch API every hour to process Aurora metrics. CloudWatch provides 1 million API requests for free in its basic tier and additional requests are charged at $0.01 per 1,000 requests. For more information about pricing, see Amazon CloudWatch Pricing.
For the reporting frontend, Lambda functions handle user requests, and Amazon Simple Storage Service (Amazon S3) hosts a static website. Amazon CloudFront provides fast content delivery, and Amazon API Gateway manages API requests.
You can use the AWS Pricing Calculator to estimate the total cost of your architecture solution.
Prerequisites
To implement the solution, you need the following prerequisites:
- An AWS account.
- A fleet of Aurora PostgreSQL clusters with each instance configured with Aurora PostgreSQL logs published to Amazon CloudWatch Logs. If you don’t have existing Aurora instances, refer to Create an Aurora PostgreSQL DB cluster.
- An Amazon Elastic Compute Cloud (Amazon EC2) instance with a PostgreSQL client installed. To build the EC2 instance, refer to Launch an Amazon EC2 instance.
- The AWS Cloud Development Kit (AWS CDK) and AWS Command Line Interface (AWS CLI) installed.
- To configure the connections between your Aurora instance and the EC2 client, refer to Configure automatic network connectivity with an EC2 instance.
If you choose to use Amazon Linux 2023 server, use the following command to install the postgresql client and pgbench tool for PostgreSQL version 15:
In this post, the application is set up to monitor seven Aurora PostgreSQL clusters.
- You need to define which metrics to monitor. You can monitor cluster-level and instance-level metrics based on your observability requirements. For more details, see Amazon CloudWatch metrics for Amazon Aurora. For this post, we monitor the following instance-level metrics:
- BufferCachehitRatio (%)
- CPUUtilization (%)
- FreeableMemory (bytes)
- You also need a CloudWatch metric gathering policy. In this solution, you can define the values of the following measures. An instance will only be reported as unhealthy when all the conditions are met.
- Metrics name – The instance metric you monitor.
- Threshold – The specific value for the metric.
- Metric interval – The duration of time over which a metric data point is collected and aggregated. For example, 1 hour.
- Statistics – The type of statistics of a metric. Values accepted are Minimum or Maximum.
The following table summarizes the default values used for this post.
Metric Name | Threshold | Metric Interval | Statistics | |
BufferCachehitRatio | 90% | Hourly | Minimum | |
CPUUtilization | 80% | Hourly | Maximum | |
FreeableMemory | 2048000000 bytes | Hourly | Minimum |
- Lastly, you need to define an observation period. The observation period defines the amount of historical data the solution must analyze to provide the results of the analysis and recommendations. The default value in this solution is the last 90 days.
(Optional) Simulate a workload in Aurora PostgreSQL databases
To demonstrate the solution, we simulate a database workload in one of the monitored Aurora instances. If you want to do the same, complete the following steps:
- Connect to the EC2 instance where psql is installed.
- Connect to one of the Aurora instances as a privileged database user:
- Run the following commands to create the example databases:
- Run the following pgbench commands to generate CPU and memory load:
pgbench is a benchmarking tool specifically designed for PostgreSQL databases. We use the following commands:
- -c 100 – Simulates 100 clients
- -j 8 – Each client runs 8 threads
- -t 80000 – Each thread runs 80,000 transactions
These commands generate a load on the databases example1, example2, and example3 by performing a series of INSERT, UPDATE, and DELETE operations, effectively testing the databases’ performance under stress.
Deploy the solution
To deploy the solution, complete the following steps:
- Clone the GitHub repo using the following command:
- Connect to your AWS account using the AWS CLI. For instructions, see Sign in through the AWS Command Line Interface.
- Navigate to the repository folder and run the following commands to install dependencies:
- Edit the context.json file in the server directory to specify the metric gathering criteria, if you want to change the metrics or threshold:
- name – Enter the Aurora PostgreSQL metric names you want to report.
- scheduleDurationInHours – Enter the metric interval. The recommended value is 1 hour.
- threshold – The metric threshold value above or below which an alert is read by the monitoring solution.
- statistics – This can hold two values: Minimum or Maximum. Define if the alert is raised for exceeding the maximum or receding to the minimum threshold value you defined.
- unit – Specify the unit of the Aurora PostgreSQL metric (specified in the name) threshold. For example, percent or bytes.
The following code shows an example shared in the prerequisites section:
- Deploy the solution:
After you successfully deploy the application, the monitoring dashboard access link will be provided as an output, as shown in the following screenshot. The deployment should complete in 10 minutes.
This is the URL to access the solution’s dashboard page. Open the URL in your browser.
In the next section, we review the report findings.
Review report findings and recommendations
The following screenshot shows an example of the main page of the dashboard. The top section features a pie chart for each configured metric, representing healthy and unhealthy instances in your AWS account. The lower section includes instructions for guidance.
Report on unhealthy instances for a specific metric
On the main page, choose a metric of interest, for example CPUUtilization. Choose the red area (unhealthy). Choose the number displayed next to Resource count. Here we see two instances out of seven are unhealthy.
A new webpage opens that shows all the instances that breached the CPUUtilization threshold you defined during the monitoring window.
The metric logs include the following details:
- ID – The Aurora PostgreSQL instance ID
- Metric value – The recorded metric value that triggered a breach during the specified time range
- Date/time range – The period during which the metric value was captured
Based on the preceding report, it’s evident that the two instances are experiencing CPU resource issues. A deeper investigation is recommended to determine the root cause of the problem.
To select a specific time range for your report, instead of the default 90 days, choose the Date Range field. Select from the preset options on the list or use the Custom range option.
Root cause analysis and tuning recommendations
The application provides tuning guidance specific to Aurora PostgreSQL-Compatible for each monitored metric, helping you optimize performance efficiently. On the main dashboard, choose Info next to the metric of interest, such as CPUUtilization. The right pane will display detailed information, including the symptom, possible causes, and recommendations for resolving issues. We have created a recommendation page for each metric we monitor. Refer to the README file you cloned from the GitHub repo for detailed information.
This feature empowers you to take proactive steps toward performance optimization. By offering targeted recommendations, the solution reduces the time spent on diagnosing problems and helps make sure resources are utilized effectively, leading to improved system stability and overall database performance.
Clean up
To avoid incurring future charges, delete the resources using the following commands:
Summary
In this post, we showed you how to achieve better visibility into the health of your Aurora PostgreSQL instances, proactively address potential issues, and maintain the smooth operation of your database infrastructure. The solution is designed to scale with your deployment, providing robust and reliable monitoring for even the largest fleets of instances. The solution can be enhanced by adding additional metrics from the available Aurora PostgreSQL metrics. Additionally, you can publish custom metrics to CloudWatch and monitor them using this solution.
Give this solution a try for your specific use case, and feel free to share your feedback or ask any questions in the comments.
About the Authors
Anand Komandooru is a Principal Cloud Architect at AWS. He joined AWS Professional Services in 2021 and helps customers build cloud-centered applications on the AWS Cloud. He has over 20 years of experience building software, and his favorite Amazon leadership principle is Leaders are right a lot.
Anubhava Srivastava is Frontend Developer with AWS Professional Services. Anubhava is passionate about building full stack applications with specialization in frontend web and mobile technologies.
Noorul Hasan is a DB Migrations Consultant with AWS Professional Services. His team helps AWS customers migrate and modernize their workloads to the AWS Cloud.
Li Liu is a Principal Architect at AWS. Specializing in database migration and modernization, Li has been helping customers innovate and transform complex legacy systems into modern, scalable database solutions since joining AWS. She is passionate about leveraging cutting-edge database technologies to solve real-world business challenges.
Subhash Dike is a Sr Cloud Application Architect at AWS Professional Services. Subhash focuses on architecting, designing, and implementing sophisticated applications within AWS Cloud ecosystems. He partners with AWS customers to address complex business challenges, proposing innovative AWS solutions for migration and modernization. His passion lies in leveraging technical expertise to help AWS customers succeed through innovative cloud solutions.
Source: Read More