In PostgreSQL, indexes are vital because they allow efficient access to specific data records based on the values in specified columns. PostgreSQL offers a variety of index types to address different use cases and optimize query performance. Some of the most commonly used index types include: B-tree, GiST (Generalized Search Tree), GIN (Generalized Inverted Index), HASH, BRIN (Block Range Index) indexes. Additionally, PostgreSQL’s index infrastructure is extensible, allowing developers to create custom index types tailored to their specific requirements. This extensibility is achieved through the use of operator classes, which define the behavior of indexes for specific data types and operations.
In Part 1 of this series, we discussed the B-tree index and its various use cases in detail. In this post, we discuss other native indexes supported in Amazon Aurora PostgreSQL-Compatible Edition and Amazon Relational Database Service (Amazon RDS) for PostgreSQL, including GIN, GiST, HASH, and BRIN, and their use cases.
In the following sections, we discuss the strategies to apply when creating an index depending on your data retrieval needs and use case.
Prerequisites
To follow along, you should have an RDS for PostgreSQL or Aurora PostgreSQL database and a client machine with psql installed and a connection to the database. For more information, refer to Connecting to a DB instance running the PostgreSQL database engine.
Refer to the following script to generate the sample dataset. This script is generating a sample dataset for an employee table with various columns containing different data types like character strings, integers, dates, arrays, JSONB, etc. It’s using PostgreSQL functions like RANDOM(), REPEAT(), GENERATE_SERIES() to generate random sample data which will used throughout this post.
GIN index
A Generalized Inverted (GIN) index is commonly used when the items to be indexed are of composite values and searching is required for an element that appears within the composite items. It’s best suited for applications using JSONB, Arrays and full text search data types and features.
A GIN index also uses various operator classes based on content and which data type and operator class you are using—for example, jsonb_ops, gin_trgm_ops, and more. The operator class identifies the operators to be used by the index for that column, The main reason for having operator classes is that for some data types, there could be more than one meaningful index behavior.
Additionally, GIN indexes only support bitmap index scans due to the fact that they only store parts of the row values in each index page. Bitmap Index Scan means the database is performing the tree traversal and following the leaf node chain to find all matching entries. Once the database finds the matching entries, it will then sort the rows according to the physical storage location of the rows in the table and fetch the rows from the table. That last part is the Bitmap Heap Scan.
Let’s examine some use cases of when a GIN index should be used.
Full text search
A full text search is used to perform natural language searches on documents such as articles, websites, or other written formats. PostgreSQL has specific data types and built-in functions to work with full text search. You can use a GIN index to speed up your full text search. Full text search uses the built-in to_tsvector and to_tsquery PostgreSQL functions. The tsvector type is used to represent a text-searchable document, whereas the tsquery is used to represent text queries in the text query format.
The following query searches for emp_cities with “city4†and “city0â€
Now we create a GIN index on the emp_cities column and rerun the same query:
JSONB
A GIN index is also useful to speed up querying JSON data. There are two operator classes available in PostgreSQL: jsonb_ops and jsonb_path_ops. jsonb_ops is used by default when no operator class is specified while creating a GIN index.
Here, we use the previous example of an employee table, which has a city list stored in JSONB format:
Now we create a GIN index on the emp_address_hist column and rerun the same query:
Arrays
PostgreSQL allows columns of a table to be defined as variable-length, multidimensional arrays. You can create arrays of any built-in or user-defined base type, enum type, composite type, range type, or domain. You can use a GIN index to speed up searching elements within the arrays.
Let’s consider the same employee table, where we have data in the form of arrays in emp_city_list. Let’s query this array to get employee details for an employee in “city5â€:
Now we create a GIN index on the emp_city_list column and rerun the same query:
In all these cases, we saw an improvement in query execution time with a GIN index, Although GIN indexes can speed up various data select operations, data updates in a GIN index are rather slow. It’s important to understand your update frequency to determine if using a GIN index is the best approach for your workload.
GIN indexes provide a few parameters like fastupdate, which we can specify during index creation or update later. This parameter asks the PostgreSQL engine to accumulate updates in a separate unordered list and update the index only when the number of accumulated updates is large enough, as defined in the gin_pending_list_limit configuration parameter.
Here are some scenarios, where using fastupdate for a GIN index can be advantageous:
Frequent Updates: If your application frequently updates the indexed data, enabling fastupdate can significantly reduce the overhead of index maintenance.
Large Datasets: For large datasets where rebuilding the entire index on every update operation is impractical or too slow, fastupdate can provide performance benefits.
High Write Throughput: In scenarios with high write throughput, such as real-time data ingestion or logging systems, fastupdate can help maintain acceptable performance levels.
Mixed Workloads: If your application has a mix of read and write operations, and write operations significantly outnumber read operations, fastupdate can help ensure that write performance is not disproportionately affected by index maintenance.
GiST index
You can use Generalized Search Tree (GiST) indexes in PostgreSQL for various types of queries and operators, such as equality operators, range queries, and partial matches. A GiST index’s mainly used with the range queries where some kind of bounds involve along with full text searches and geospatial.
Like the GIN index, GiST also supports various operator classes to extend its uses, like gist_trgm_ops, gist__int_ops, and more.
Another important feature that GiST supports is the exclusion constraint. The exclusion constraint ensures that given fields of any two table rows don’t correspond to each other in terms of some operators. The exclusion constraint compares two rows in the table, where the user can give conditions such as “no two rows overlap†or “no two rows can be different.â€
Full text search
As with GIN indexes, we can use a GiST index for full text searches. For this post, we use the same example as in the previous section.
In the following code, we use the employee table and create a GiST index instead of a GIN index and rerun the same query:
As we can see, the performance with the GiST index is comparable to the GIN index in the case of a full text search, but a GIN index is preferred for text search because it provides more faster and GiST index on the other hand can be lossy, but you can re-rank the results, to do this you can use the `@@` operator along with a ranking function As we discussed previously, a GIN index is quite expensive when your table has frequent updates. Therefore, when choosing an index for a full text search, you should consider your use case requirements.
Range types
A GiST index is also useful for queries that deal with interval data types like tsrange. To understand this better, let’s consider the employee table and search for details for employee tenure:
HASH index
A HASH index stores a 32-bit hash code derived from the value of the indexed column; therefore, it can have 2³²-1 unique hash codes. You should consider using a HASH index only if the index column is used with the equality (=) operator. This is a common alternative to using a B-tree index for search conditions where only the = operator is used.
A HASH index stores only 4-byte hash values instead of actual column values, so it’s much smaller than a B-tree index.
There are many use cases for HASH index such as Equality Lookups (where you need to find rows with a specific value in an indexed column), Unique Constraint Enforcement, Join Operations (can be useful in optimizing certain types of join operations, particularly hash joins) and many more.
To understand HASH index using example here post, we search an element from an array. In PostgreSQL, you can directly create an index on an array, and if the major application operator involves only the = operator, the HASH index performs better than GIN or B-tree indexes.
For demonstration purposes, we create a composite type array and create B-tree, GIN, and HASH indexes. We examine the cost for each index type for an = operator on the composite type:
As we can observe in this example, the planner cost for a HASH index is 8.02, which is better when compared to the B-tree (8.57) and GIN (24.02) indexes.
HASH indexes may have better performance in some use cases where the array is being compared as a record not for a range of records, but it might not be suitable for tables with a rapidly increasing number of rows. Additionally, duplicate data in a table can reduce HASH index performance.
BRIN (min max) index
A Block Range Index (BRIN) works on a block (page range) of data. A block is group of pages lying physically adjacent on the disk for a table. For each block range, summary information is stored by the index. This information stores the minimum and maximum value for a range of adjacent table pages.
BRIN is typically used for timeseries data, ordered datasets or when data types like integers—dates where the sort order is linear—can be stored as min and max values in the range. BRIN is incredibly helpful in efficiently searching over large time series data and has the benefit of taking up significantly less space on disk than a standard B-tree index. However, scanning an index will return more rows than you asked for, so you must filter out these additional rows as an additional step in planning your query.
For example, if we consider emp_id, which is created as the sequential number using the generate_series() function, and build a BRIN index, we can use a BRIN index scan. However, the runtime (0.512 milliseconds, as seen in the previous example) of the B-tree index will still be less compared to the BRIN index (3.852 milliseconds), but the index size of the BRIN index (384 KB) will be smaller compared to the B-tree index (928 MB).
Clean up
Complete the following steps to clean up the resources and data you created during this post:
Delete your DB cluster or DB instance.
Delete the sample data with the following code:
Conclusion
In this post, we discussed various use cases for GIN, GiST, HASH, and BRIN indexes supported in Amazon Aurora PostgreSQL and Amazon RDS for PostgreSQL.
If you have similar use cases in your application or workload, you can refer to this post and implement similar solutions.
In upcoming Part 3 of this series, we discuss indexes supported through database extensions like btree-gin, btree-gist, bloom, and SP-GiST.
Leave any thoughts or questions in the comments section.
About the Authors
Sachin Khanna is a Lead Database Consultant on the AWS Professional Services team at AWS. He has extensive experience in relational databases with a passion for Amazon Aurora PostgreSQL. He is a consultant with expertise on performance and optimization. He carries rich experience in database migrations and cost optimization, and has helped customers in their cloud migration journey.
Rajkumar Raghuwanshi is a Database Consultant with AWS Professional Services based out of Pune, India. With a good knowledge of relational databases and hands-on experience in homogenous and heterogenous database migrations, he helps customers migrate to the AWS Cloud and optimize performance.
Sachin Kotwal is a Lead Database Consultant with AWS Professional Services based out of Pune, India. He has more than a decade of experience in relational database administration and architecture. He has worked on many homogenous database migrations and performance optimizations.
Source: Read More