Amazon DynamoDB supports incremental exports to Amazon Simple Storage Service (Amazon S3), which enables a variety of use cases for downstream data retention and consumption.
In this post, we show you how to maintain a continuously updating export of your table data by doing a bootstrap full export followed by an ongoing series of incremental exports. DynamoDB Continuous Incremental Exports (DCIE, the name of this solution, available as open source on GitHub) enables use cases which require a DynamoDB export of data with the data kept fresh.
One use case for DCIE is offline analytics against a DynamoDB table. As the table size grows, doing repeated full exports can be more expensive than doing a single full export followed by a series of incremental exports. With this solution, you can keep an Apache Iceberg table fresh.
Another DCIE use case is to replay changes from one DynamoDB table to another DynamoDB table. You first bootstrap the new table based on the full export and then play forward the series of incremental exports until the new table is up to date. The new table can be another table in the same AWS account, a different AWS account, or even in another AWS Region.
Note that DCIE only creates the series of exports. Handling those exports by feeding them to an Iceberg table or loading them into another DynamoDB table is out of scope for this post.
Continuous exports
This section explains how the workflow operates. It first makes a full export to Amazon S3. This captures the table contents as of a specific point in time. The following diagram shows a full export made at a recent time point (called t=0).
It then performs periodic incremental exports, running at 15-minute intervals by default. Each incremental export captures the changes that happened between two time points (here, the end of the full export to the time 15 minutes later). Incremental exports are somewhat like a diff you see when coding. New exports then run for every ongoing time period.
The time periods need to match up exactly to ensure no gaps in coverage. The end time of the full export must be the start time of the first incremental. The end time of the first incremental export must be the start time of the second incremental export, and so on. This makes sure the full table contents are represented in the series of exports.
As shown in the following figure, an incremental export started at t=1 might take a while to complete and finish after t=2, which is OK. Two exports can run at the same time. DynamoDB exports include metadata so you can determine when the export has completed successfully. The export metadata also allows downstream consumers to only process successful exports and in the right chronological order.
Solution overview
DCIE uses Amazon EventBridge Scheduler to schedule the repeating workflow and AWS Step Functions to manage the orchestration within the workflow. Step Functions make it easy to design, customize, run, visualize, manage, retry, and visually debug a distributed application like this.
The deployment of the workflow needs five key input values: the stack name allowing you to use the deployment many times for multiple DynamoDB tables, the name of the source DynamoDB table, a deployment alias allowing for easy table to infrastructure mapping, an email address for receiving export success notifications, and an email address for receiving any export failure notifications. The Step Functions state machine maintains its state using Parameter Store, a capability of AWS Systems Manager.
You can deploy DCIE using the AWS Cloud Development Kit (AWS CDK), providing just the preceding input values for customization. You can also provide optional configurations, such as a custom S3 bucket, an S3 prefix, a broader time window than the 15-minute default window or configure an interval to check when exports are completed that is at a lesser/greater frequency than the 10 second default interval.
The Step Functions workflow starts with some sanity checking, to make sure the named table exists and has point in time recovery (PITR) enabled. PITR is a prerequisite for the export to S3 functionality.
The workflow has two main trees: one to perform the initial full export and another to perform all the subsequent incremental exports. Each initiates the work and checks regularly for completion or errors and handles those eventualities by updating the state and sending emails as configured. You can modify the workflows yourself if you have specific needs. The following figure is a simplified logical representation of the workflow.
Please see the README in the GitHub repository for more details about deploying DCIE.
Costs
The cost of deploying DCIE includes:
Ongoing charges for enabling PITR, if it’s not already enabled (based on the size of the table)
The initial full export (based on the size of the table)
Each repeated incremental export (each one based on the size of the data processed, which is proportional to the number of changes during the time window)
Amazon S3 costs to write the objects and store the data (the storage naturally grows over time)
Costs for using AWS Lambda, Amazon Simple Notification Service (Amazon SNS), and Amazon CloudWatch Logs
Conclusion
The new incremental export to Amazon S3 feature in DynamoDB enables you to easily export data in your DynamoDB tables to downstream data consumers. In this post we presented an open source solution to continuously update an S3 bucket, with first a full export and then an ongoing series of incremental exports. You can use this to feed a downstream Iceberg table, feed a second DynamoDB table, or even copy it to a remote S3 bucket and use it as part of a disaster recovery plan to recreate a table in a remote Region.
To learn more about DynamoDB export to S3, please see our Documentation.
If you have any feedback or questions, leave them in the comments.
About the authors
Ruskin Dantra is a Solutions Architect based out of California. He is originally from the Land of the Long White Cloud, New Zealand and is an 18-year veteran in application development with a love for networking. His passion in life is to make complex things simple using AWS.
Jason Hunter is a California-based Principal Solutions Architect specializing in Amazon DynamoDB. He’s been working with NoSQL databases since 2003. He’s known for his contributions to Java, open source, and XML. You can find more DynamoDB posts and others posts written by Jason Hunter in the AWS Database Blog.
Shahzeb Farrukh is a Seattle-based Senior Product Manager at AWS DynamoDB. He works on DynamoDB’s data protection features like backups and restores, and data movement capabilities that help customers integrate their data with other services. He has been working with databases and analytics since 2010.
Source: Read More