Parallel indexing in Amazon DocumentDB (with MongoDB compatibility) significantly reduces the time to create indexes. In this post, we show you how parallel indexing works, its benefits, and best practices for implementation.
Amazon DocumentDB is a fast, scalable, highly available, and fully managed document database service that supports MongoDB workloads. You can use the same MongoDB API 3.6, 4.0, and 5.0 application code, drivers, and tools to run, manage, and scale workloads on Amazon DocumentDB without worrying about managing the underlying infrastructure. As a document database, Amazon DocumentDB makes it simple to store, query, and index JSON data.
Although indexes improve query performance, creating indexes can be time-consuming, especially for large collections. Amazon DocumentDB now supports parallel index creation to decrease the time to create indexes. Parallel indexing uses multiple concurrent workers to scan the collection, the longest stage of the index creation process. In this post, we show you how parallel indexing can reduce the time by up-to 14X to create new indexes.
Parallel index creation reduces the time needed to create indexes by using multiple CPU cores. It will temporarily strain CPU and I/O resources, potentially impacting existing operations. It’s important to review your server’s existing resource consumption when deciding the degree of parallelism for this feature, and scale up the writer node if needed for the operation.
How to use parallel indexing
Parallel indexing is currently supported on Amazon DocumentDB version 4.0 and higher instance-based clusters, with instance types of 2xlarge and above.
To create an index in parallel, specify the workers option in the createIndexes command. The workers option indicates the number of workers to build the index. The default value is 2, but you can specify a higher value to improve the performance of the build process up to the 50% of the vCPU count of the primary instance. For example, to build an index in parallel using four workers, use the following command:
Results
We conducted tests on Amazon DocumentDB 5.0 with different worker numbers to assess the effectiveness of parallel indexing on a db.r6g.4xl instance. The following graph denotes the index creation performance for a dataset of 256 GB of 1 KB documents and an index created on a single field.
These results demonstrate the performance improvement achieved with the new parallel indexing feature of Amazon DocumentDB. The overall improvement ranged from 1.46–7.42 times faster.
Best practices
Using multiple workers can significantly reduce the time to create new indexes. However, it’s important to choose a number of workers that is appropriate for your workload and infrastructure.
Additionally, you can monitor the progress of the indexing process by using the db.currentOp() command in mongoshell. Index creation details are available in Amazon DocumentDB 5.0 and higher. See the following code:
This will return an output like the following screenshot.
Finally, if possible, try to build indexes during off-peak hours to minimize the impact on your application.
Conclusion
In this post, we showed you how parallel indexing can reduce the time to create new indexes.
Parallel indexing is a powerful new feature in Amazon DocumentDB that can help you supercharge your Amazon DocumentDB index creation. It offers a straightforward way to reduce the time needed to create new indexes.
The new index feature is available in all AWS Regions where Amazon DocumentDB is available at no additional cost. To learn more, refer to Managing Amazon DocumentDB indexes.
About the Authors
Srikar Kasireddy is a Database Specialist Solutions Architect at Amazon Web Services. He works with our customers to provide architecture guidance and database solutions, helping them innovate using AWS services to improve business value.
Tim Callaghan is a Principal DocumentDB Specialist Solutions Architect at AWS. He enjoys working with customers looking to modernize existing data-driven applications and build new ones. Prior to joining AWS, he has been both a producer and consumer of relational and NoSQL databases for over 30 years.
Source: Read More