As businesses continue to generate and store vast amounts of data, the need for efficient and reliable database management systems becomes increasingly important. PostgreSQL, an open source relational database management system (RDBMS), has established itself as a powerful solution for handling complex data requirements. One of the key strengths of PostgreSQL lies in it’s extensibility. Through a rich ecosystem of extensions and plugins, developers can enhance a database’s functionality to meet specific requirements. These extensions range from spatial data support and full-text search capabilities to advanced data types and performance optimization tools. Although PostgreSQL offers a wide range of features and capabilities, one extension that often goes overlooked is pgstattuple
—a tool that can provide significant value for gaining insights into the internal workings of PostgreSQL databases.
In this post, we explore pgstattuple
in depth—what insights it offers, how to use it to diagnose issues in Amazon Aurora PostgreSQL-Compatible Edition and Amazon Relational Database Service (Amazon RDS) for PostgreSQL, and best practices for harnessing its capabilities.
Overview of pgstattuple
The pgstattuple
extension provides a set of functions to query detailed statistics at the tuple level within PostgreSQL tables and indexes. This offers visibility into the physical storage layer that standard PostgreSQL statistics views don’t provide.
Some of the table and index level metrics exposed by pgstattuple
include:
- tuple_count – Number of live tuples
- dead_tuple_count – Number of dead tuples not yet cleaned up
- tuple_len – Average length of live tuples in bytes
- free_space – Total free space available in bytes
- free_percent – Percentage of free space; higher values indicate more bloat
- dead_tuple_len – Total length of dead tuples in bytes
- dead_tuple_percent – Percentage of space occupied by dead tuples
These metrics are more than just numbers – they’re early warning system for database health and performance issues. By monitoring these statistics, you can proactively identify storage issues that might be silently impacting your database performance. Whether it’s excessive table bloat consuming disk space, or index fragmentation slowing down queries, pgstattuple
helps spot these issues before they become critical problems.
Using pgstattuple in Aurora and Amazon RDS
Both Aurora and Amazon RDS support using the pgstattuple
extension. To enable it, you first need to create the extension in your database using the CREATE EXTENSION pgstattuple;
command. After it’s enabled, you can use functions like pgstattuple(relation)
to get details on the physical storage used by a table including the number of pages, live tuples, dead tuples, and more. The pgstattuple_approx(relation)
function provides a faster estimate of these metrics. You can also get index statistics using pgstatindex(index)
. Analyzing this low-level data can help identify bloated tables that need vacuuming, find tables with high dead tuple ratios that could benefit from being rewritten, and optimize your database’s physical storage utilization.
The output of pgstattuple
provides actionable insights for monitoring, maintenance, and performance tuning, as discussed in the following sections.
Detecting and managing table bloat
Identifying and managing bloat is one of the most useful applications of pgstattuple
for PostgreSQL tables. Bloat arises when UPDATE
and DELETE
operations leave behind unused space not automatically reclaimed. PostgreSQL maintains data consistency through a Multiversion Concurrency Control (MVCC) model, where each SQL statement sees a snapshot of data from some prior time, regardless of the underlying data’s current state. This prevents statements from viewing inconsistent data due to concurrent transactions updating the same rows, providing transaction isolation per database session. Unlike traditional locking methods, MVCC minimizes lock contention, allowing reasonable multiuser performance.
When deleting a row in MVCC systems like PostgreSQL, the row isn’t immediately removed from data pages. Instead, it’s marked as deleted or expired for the current transaction but remains visible to transactions viewing an older snapshot, avoiding conflicts. As transactions complete, these dead or expired tuples are expected to eventually be vacuumed and the space is reclaimed. In PostgreSQL, an UPDATE
operation is equivalent to a combination of DELETE
and INSERT
. When a row is updated, PostgreSQL marks the old version as expired (like a DELETE) but keeps it visible to older transaction snapshots. It then inserts a new version of the row with the updated values (like an INSERT). Over time, the expired row versions accumulate until the VACUUM process removes them, reclaiming space. This approach enables PostgreSQL’s MVCC model, providing snapshot isolation without explicit locking during updates.
PostgreSQL’s autovacuum
is an automated maintenance process that reclaims storage occupied by dead tuples and updates statistics used by the query planner. The autovacuum
process triggers when maximum age (in transactions) crosses autovacuum_freeze_max_age
, or when the threshold is hit: autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * number of tuples
. In this formula, autovacuum_vacuum_threshold
represents the minimum number of updated or deleted tuples needed to initiate cleanup, while autovacuum_vacuum_scale_factor
is the table size fraction added to the threshold calculation to determine when maintenance should occur. If autovacuum
fails to clean up dead tuples for certain reasons, you may need to handle highly bloated tables manually.
Dead tuples get stored alongside live tuples in data pages. Bloat could also be because of free space in the pages, for example after autovacuum cleaned up the dead tuples. During query execution, PostgreSQL scans more pages filled with dead tuples, causing increased I/O and slower queries. Highly bloated tables make database workloads consume unnecessary read I/O, impacting application performance. Cleaning up bloat may be required if autovacuum fails.
Before we dive into analyzing table bloat with pgstattuple
, let’s ensure you have everything set up to follow along. You’ll need access to an Amazon RDS or Aurora PostgreSQL instance, as well as a client machine with psql
installed and properly configured to connect to your database. Make sure you have the necessary permissions to create tables and install extensions in your PostgreSQL environment. For this demonstration, we’ll use the pgbench_accounts
table. If you don’t have this table already, you can easily create it using the pgbench
utility. Run the command pgbench -i -s 10
to initialize a pgbench
schema with a scale factor of 10, which will create the pgbench_accounts
table along with other necessary tables. This will provide us with sample data to work with in our analysis. Additionally, you should have the pgstattuple
extension installed on your database instance. If you haven’t installed it yet, you can do so by running CREATE EXTENSION pgstattuple;
as a user with sufficient privileges. With these prerequisites in place, you’ll be ready to explore table bloat analysis using real data in a controlled environment.
While pgstattuple
provides comprehensive table bloat analysis, it can be resource-intensive. We recommend first using the lightweight bloat estimation queries documented here. If more detailed analysis is required, here’s how to use pgstattuple
. The following example demonstrates how to use pgstattuple
to analyze bloat information in a table.
Create the table pgbench_accounts_test
with 10,000 records:
In this example, querying pgstattuple returns dead tuple count as zero, and table size 1672kB:
To demonstrate the usage of pgstattuple, we turn off autovacuum (not recommended in a production environment), and update 2,500 records:
Now, pgstattuple
data for this table shows 2,500 old versioned tuples are moved to the dead tuple.
bloat_percentage
in PostgreSQL refers to the ratio of space that could be reclaimed in a table or index compared to its total size. It can be calculated using data from pgstattuple
as follows:
A bloat_percentage
value exceeding 30%–40% often indicates problematic bloat requiring attention. To clean up bloat, use the VACUUM command:
Let’s inspect the pgstattuple
data after the VACUUM operation:
The VACUUM operation resets dead_tuple_count
to 0. The available space still attached with the table is available for insert or update operations in the same table. This makes table_len
the same even after the VACUUM operation. For reclaiming disk storage occupied by bloat, there are two options:
- VACUUM FULL – VACUUM FULL can reclaim more disk space but runs much more slowly. It requires an ACCESS EXCLUSIVE lock on the table it’s working on, and therefore can’t be done in parallel with other uses of the table. While VACUUM FULL operations are generally discouraged in production environments, they might be acceptable during scheduled maintenance windows where downtime is planned and approved.
- pg_repack – pg_repack is a PostgreSQL extension that efficiently removes table and index bloat while maintaining online availability. Unlike CLUSTER and VACUUM FULL, it minimizes exclusive lock duration during processing, offering performance comparable to CLUSTER. While
pg_repack
allows for online table and index reorganization with minimal application downtime, it’s important to consider its limitations. The extension still requires brief exclusive locks during operation and may struggle to complete on tables with high-velocity transactions, potentially impacting database performance. For heavily-used tables where full repacking is challenging, consider the alternative of index-only repacking. Best practices include thorough testing in a non-production environment, scheduling during low-traffic periods, and having a monitoring and rollback plan in place. Despite its benefits, users should be aware of the potential risks and plan accordingly when implementingpg_repack
in their PostgreSQL environments.
The VACUUM FULL operation reduces the table_len
:
The VACUUM FULL operation reclaims the wasted space to disk storage and reduces table_len
. The following query identifies bloat for the top 10 largest tables in your database using pgstattuple
:
pgstattuple
does a full table scan and may cause higher consumption of instance resources such as CPU and I/O. This makes the pgstattuple
operation slow for large tables. Alternatively, the pgstattuple_approx(relation)
function provides a faster estimate of these metrics. While it’s less resource-intensive than pgstattuple
, it could still be heavy on very large tables or busy systems. Consider running during off-peak hours or on a replica if available.
Automating manual vacuum
Regularly monitoring for bloat enables you to identify maintenance needs proactively, before performance suffers. The bloat metrics can also help fine-tune autovacuum settings to clean up space more aggressively if needed. After you identify the top 10 bloated tables, you can automate the VACUUM operation by using the pg_cron
extension. pg_cron
is a cron-based job scheduler for PostgreSQL that runs inside the database as an extension. It uses the same syntax as regular cron, but it allows you to schedule PostgreSQL commands directly from the database. The following code is an example of using the pg_cron
function cron.schedule
to set up a job to use VACUUM on a specific table every day at 23:00 (GMT):
After the preceding example runs, below query checks the history in the cron.job_run_details
table as follows:
Diagnosing and resolving index bloat
Like tables, indexes in PostgreSQL can experience bloat that wastes space and impacts performance. pgstattuple
enables detecting index bloat using pgstatindex
.
The following query shows the index identifier, total index size in bytes and average leaf density:
Average leaf density is the percentage of useful data in the leaf pages of the index. Significantly bloated indexes can be rebuilt with REINDEX
or pg_repack
to eliminate dead space and restore optimal performance. It’s recommended to periodically check busy, high-churn indexes for bloat.
Assessing index fragmentation
Another valuable use of pgstattuple
is identifying index fragmentation issues. Fragmentation occurs when index pages become scattered due to deletions, updates, and page splits. Heavily fragmented indexes have more dead tuples occupying space inefficiently.
We can check the degree of fragmentation using leaf_fragmentation
:
If leaf_fragmentation
is high, the index is likely fragmented and a REINDEX should be considered. Rebuilding eliminates fragmentation and associated performance overhead.
Best practices for using pgstattuple
Consider the following best practices when using pgstattuple
for PostgreSQL monitoring and maintenance:
- To estimate bloat in PostgreSQL tables, use check_postgres query mentioned on PostgreSQL wiki.
- Use the pgstattuple extension to analyze the physical storage of database tables, providing detailed statistics on space usage within the database, including how much space is wasted due to bloat.
- Rebuild significantly bloated tables and indexes to reclaim dead space
- Watch for high
dead_tuple_percent
to identify fragmentation issues - Focus maintenance on tables and indexes critical for workload performance
- Avoid running
pgstattuple
on highly active tables to prevent interference - Use
pgstattuple
metrics to fine-tune auto-vacuum settings - Combine
pgstattuple
with query analysis and logs for holistic database insights
Conclusion
The pgstattuple
extension serves as a powerful tool for uncovering crucial diagnostic metrics in PostgreSQL databases, exposing detailed storage statistics that help teams identify and address performance-impacting issues like bloat and index fragmentation. Working seamlessly with Aurora and RDS PostgreSQL, the extension provides essential visibility into storage patterns and maintenance requirements. Following pgstattuple
best practices is key to maintaining efficient, high-performing PostgreSQL databases, and organizations can further enhance their database management through AWS’s support options – AWS Enterprise Support, Enterprise On-Ramp, and Business Support customers can leverage AWS Countdown Premium engagements for optimization guidance, enabling teams to confidently implement best practices and maintain optimal performance while focusing on their core business objectives.
We welcome your comments and feedback in the comments section.
If you have any questions or concerns, leave them in the comment section.
About the Author
Vivek Singh is a Principal Database Specialist Technical Account Manager with AWS focusing on Amazon RDS for PostgreSQL and Amazon Aurora PostgreSQL engines. He works with enterprise customers providing technical assistance on PostgreSQL operational performance and sharing database best practices. He has over 17 years of experience in open-source database solutions, and enjoys working with customers to help design, deploy, and optimize relational database workloads on AWS.
Kiran Singh is a Senior Partner Solutions Architect and an Amazon RDS and Amazon Aurora specialist at AWS focusing on relational databases. She helps customers and partners build highly optimized, scalable, and secure solutions; modernize their architectures; and migrate their database workloads to AWS.
Sagar Patel is a Principal Database Specialty Architect with the Professional Services team at Amazon Web Services. He works as a database migration specialist to provide technical guidance and help Amazon customers to migrate their on-premises databases to AWS.
Source: Read More