An index is a data structure used by database systems to speed up queries. In most traditional relational database systems, indexes play a vital role in data retrieval. There are different index types, and each type uses a different algorithm that is best suited to different types of queries. A suboptimal index strategy, for example an incorrect index type, too many indexes, or no indexing at all can lead to performance degradation in your database.
In this series of posts, we discuss index types supported in Amazon Aurora PostgreSQL-Compatible edition and Amazon Relational Database Service (Amazon RDS) for PostgreSQL and their use cases. In this post, we discuss the native B-tree index and its variations. In an upcoming Part 2, we discuss other native indexes, including GIN, GiST, HASH, and BRIN.
Before deciding on what index is needed in our database, we need to understand the access patterns and the data that needs to be retrieved using an index. Indexes come with maintenance and space overhead, and we should apply it reasonably. In the following sections, we discuss the strategies to apply when creating an index depending on your data retrieval needs and use case.
Prerequisites
To follow along, you should have an RDS for PostgreSQL or Amazon Aurora PostgreSQL-compatible database and a client machine with psql installed and a connection to the database. For more information, refer to Connecting to a DB instance running the PostgreSQL database engine.
Refer to the following script to generate the sample dataset. This script is generating a sample dataset for an employee table with various columns containing different data types like character strings, integers, dates, arrays, JSONB, etc. It’s using PostgreSQL functions like RANDOM(), REPEAT(), GENERATE_SERIES() to generate random sample data which will used throughout this post.
B-tree indexes
A balanced tree (B-tree) is a self-balancing tree data structure that maintains sorted data and allows searches. B-tree indexes can handle equality and range queries on data that can be sorted. You can use a B-tree index whenever an indexed column is involved in a filter condition with one of these operators: <, <=, =, >=, and >. The B-tree indexing method in PostgreSQL is the most commonly used index type, and is the default for many data types.
The following code shows a query explain plan without an index on the emp_id column with the equality operator:
(Note: Parallel Query execution is disabled (set max_parallel_workers_per_gather = 0) for this example.)
Now we run the same explain plan after creating an index on emp_id.
We can observe from the explain plan that optimizer is choosing index scan (on index over sequential scan after index is created on filter condition column (WHERE emp_id=444). Now we are getting faster execution time after index creation.
In the remainder of this post, we’ll review what other specialized B-tree indexes we can implement to get faster query execution time. We also look at scenarios where we can take leverage of b-tree index to get faster query response.
Index Only Scan
Index are pointers to actual data present on disk and are stored separately from the table’s main data area. (i.e. table heap). This means each row retrieval will require fetching data from index as well as table heap. While index entries for fetched record (EMP_ID = 444) will be placed closer on the disk, table row could be present anywhere on the table heap. The heap-access portion of an index scan thus involves a lot of random access into the heap, which can be slow. So, to make index scan further fast, if the columns used in the filter conditions are the only columns fetched in your select, then row can be fetched only using index heap area. This is referred as Index only scan.
Impact of indexing on range operations.
B-tree also works for queries with the WHERE clause with “BETWEEN AND†operator, which is same as >= and <=. For example, values greater than qual to 111 and less than equal to 222 can also be represented as values between 111 and 222.
The following code uses a B-tree index for BEWTEEN AND operator.
Using B-Tree to speed up ordering
A PostgreSQL B-tree index can produce sorted output. You can also use a B-tree to return the “top N†ordered selections. You can also use an index to order by the top N rows type of SQLs. Because the B-tree returns the data in sorted order, there is no need to sort the entire dataset. This also benefits queries that use aggregates like min/max. See the following code:
Using B-tree indexes with similarity queries (LIKE)
You can use a B-tree index for queries using a pattern matching operator (LIKE OR ~ OR !~). This is only applicable when the database is created using the C collation or table column on which the index is needed should be created collated as C in create table statement. (emp_name CHARACTER VARYING COLLATE “Câ€). You can still use LIKE in pattern matching for non-C locale if you explicitly define it as part of the operator class: Refer to Operator Classes and Operator Families for details.
Pattern matching is for a constant and should be present at the beginning of the string—for example, queries with emp_name like ‘dub%’. The index will not be used for SQL with emp_name like ‘%dub’.
Let see example below with like clause without index.
Let’s see an example where we have 2 employee table in different schema, one with emp_name with C collation and one without it.
Like query on public.employee with column (emp_name CHARACTER VARYING COLLATE “Câ€) will go through index scan.
Like query on test.employee with column (emp_name CHARACTER VARYING) i.e. Without COLLATE “C†clause, will go through sequential scan.
Multi-column indexes
An index created on multiple columns is called a multi-column index. If you have a some queries in your use case that needs filtering on column ( emp_name ) or ( emp_dep ) and also have a few queries that involve both ( emp_name, emp_dep ) columns as well, you can create a multicolumn index. A single multi column index on both columns will generally use index scan for all queries involving the column combinations of these columns using equality operator.
We can see in the below example that if we have created multi-column index on ( emp_name, emp_dep ), then the index scan is being used in all such cases when either of column emp_name or emp_dep or emp_name and emp_dep is used in where clause.
However, the order of the columns in the index matters. A multicolumn B-tree index is most efficient when there are constraints on the leading (leftmost) columns. For example, given an index on (emp_age, emp_salary, emp_id) and a query condition WHERE emp_age = 45 AND emp_salary >= 1000 AND emp_id < 200, the index would have to be scanned from the first entry with emp_age = 45 and emp_salary = 1000 up through the last entry with emp_age = 45 . Index entries with emp_id < 200 would be skipped, but they’d still have to be scanned through.
Partial indexes
A partial index is an index created on a subset of table data. The subset is defined on a conditional predicate. One major use case of this index type is to avoid indexing common values. For example, suppose we have the emp_age field, where a very small percentage of employees are not 35 years old, and we’re only interested in finding these employees and don’t care about common values (employees who are 35). We can create a partial index on this subset of table data (age <> 35) and use an index for SQL to filter non-common values (all ages other than 35). Another advantage of this index type is the index size, which will be smaller than an index on a column without conditions.
Another use case is when you’re only fetching specific data according to your interest (eliminating rows from the index that you don’t want to retrieve, such as uninteresting values). This is the more common use case than the one above. You can also create unique partial indexes for given conditions.
There is an overhead with partial indexes because it requires that the common values be predetermined. Therefore, partial indexes are best used for data distributions that are well known and don’t change over a period of time. Note that you shouldn’t create too many partial indexes defined for too many conditions (ranges), which would result in partial indexes as a substitute for partitioning.
Functional indexes
A functional index is defined on the result of an immutable function applied to one or more columns of a single table. You can use a functional index to improve performance of the query with some expressions or functions. In the following example, we create a functional index on the column emp_name with the LOWER function:
You can also use a functional index to create an index with different data types than its column data type.
For example, if the application is querying using the BPCHAR type in a WHERE clause but the column (emp_name) is of type VARCHAR/TEXT, then we must convert the index type to BPCHAR, while creating it, which will give better performance. See the following code:
Cluster the table on the index
When you need to retrieve data within a range for example, ( emp_id > 100 and emp_id < 200 ), the response of this SQL will be different on different systems depending on the physical placement of data on the disk. In our example, because we generated the emp_id data using generate_series(), the data is organized by emp_id and is physically lying sequentially on disk. This is called clustering the table data. You can cluster tables at the index level, and you can use this method for more than one index. This is useful when you want related data to be kept close to each other and hence the cost of retrieving the data should be less. If you have index on the column in the where condition and still cost of execution is high due to it is going for Bitmap heap scan then just cluster the index and you will get performance benefit as planner will go for index scan.
Let’s look at an example. If we load data into the employee table using the preceding script with an order by random() clause, we can simulate a real-life non-clustered table situation ( correlation <> 1 ). We see the query run plan change to bitmap Index Scan, and it takes longer to fetch this data because it’s not present in a sequential manner on disk ( correlation=1 ). Instead, it’s distributed ( correlation <> 1 ).
Note that correlation is a statistical correlation between physical row ordering and logical ordering of the column values. This ranges from -1 to +1. When the value is near -1 or +1, an index scan on the column will be estimated to be cheaper than when it is near zero, due to the reduction of random access to the disk.
If we cluster the table using the index on emp_id (correlation = 1), the planner will choose Index Scan over bitmap Index Scan because it is aware (through updated statistics). The data is now placed correlated on the disk and fetching from the disk is faster.
NOTE: When a table is being clustered, an ACCESS EXCLUSIVE lock is acquired on it. This prevents any other database operations (both reads and writes) from operating on the table until the CLUSTER is finished. Hence perform this operation during maintenance window.
When to choose Clustering
If you’re requesting a range ( emp_id > 444 and emp_id < 555 ) of indexed values from a table, or a single indexed value that has multiple rows ( emp_id > 444 ) that match, clustering the table will help because after the index identifies the table page for the first row that matches, all other rows that match are probably already on the same table page, so you save disk reads and speed up the query. The runtime will be further reduced to take advantage of the clustered data.
Clean up
Complete the following steps to clean up the resources and data you created during this post:
Delete your DB cluster or DB instance.
Delete the sample data with the following query:
About the Authors
Rajkumar Raghuwanshi is a Database Consultant with AWS Professional Services based out of Pune, India. With a good knowledge of relational databases and hands-on experience in homogenous and heterogenous database migrations, he helps customers migrate to the AWS Cloud and optimize performance.
Sachin Khanna is a Lead Database Consultant on the AWS Professional Services team at AWS. He has extensive experience in relational databases with a passion for Amazon Aurora PostgreSQL. He is a consultant with expertise on performance and optimization. He carries rich experience in database migrations and cost optimization, and has helped customers in their cloud migration journey.
Sachin Kotwal is a Lead Database Consultant with AWS Professional Services based out of Pune, India. He has more than a decade of experience in relational database administration and architecture. He has worked on many homogenous database migrations and performance optimizations.
Source: Read More