Amazon Keyspaces (for Apache Cassandra), launched in 2020, has been a popular choice for customers looking to move from self-managed and open-source Cassandra to a fully managed and highly available database service. Many of these customers have expressed a need for tooling that can migrate their existing Cassandra data to Amazon Keyspaces, simplifying online migrations. To address this demand, AWS Solutions Architects created and open-sourced a utility called CQL Replicator. This utility streamlines the process of migrating data by taking a snapshot of your table and also continuously replicating recent changes from the source Cassandra cluster to Amazon Keyspaces.
With Amazon Keyspaces, you can increase focus on innovation for your Cassandra-backed applications. By migrating to Keyspaces, customers like Intuit reduced p99 latency, production incidents, and improved backup resiliency. Customers like Monzo Bank have also been able to scale capacity while maintaining predictable performance and availability with no interruption to services. Keyspaces is not just rehosting of your Cassandra workload, but modernizing it to serverless, fully managed, highly available, and a scalable Cassandra compatible database service.
However, some customers need assistance migrating data between self-managed Cassandra clusters and Amazon Keyspaces. Importing large datasets efficiently at scale is time-consuming, and traditional migration approaches that rely on change data capture (CDC) have challenges. These challenges include long replication delays, deduplication, and difficulty parsing commit logs. Enabling CDC also requires changes to the source cluster, new sidecar processes, and administrative work to clean up commit log files. All of these challenges take time to plan and operationalize.
In this post we walk through the steps to setup CQL Replicator and migrate a table from self-managed Cassandra cluster to Amazon Keyspaces. We demonstrate how to set up, run, and shut down the CQLReplicator job using command line tooling and observe changes flowing through the pipeline in Amazon CloudWatch.
Solution overview
CQLReplicator is an open-source utility built by AWS Solutions Architects that helps you migrate data from self-managed Cassandra to Amazon Keyspaces in near real time with minimal downtime. With CQLReplicator, you can read data in near real time through intelligently scanning the Cassandra token ring. It implements a caching strategy to reduce performance penalties of full scans and removes duplicate replication events to reduce the number of writes to the destination. With CQLReplicator, changes can be replicated in minutes, allowing you to migrate to Keyspaces with minimal downtime.
The following diagram depicts a typical architecture of a CQLReplicator job. CQLReplicator uses AWS Glue to continuously copy data from your self-managed Cassandra cluster. An AWS Glue connection is configured to allow access to Cassandra running in a private VPC. CQLReplicator Glue jobs stream changes from Cassandra to Amazon Keyspaces, using Amazon Simple Storage Service (Amazon S3) for deduplication and key caching.
Prerequisites
The CQLReplicator project is open source and available for download on GitHub. You can get started by cloning the repository to your local machine or AWS CloudShell:
To run the examples in this post, you also need the AWS Command Line Interface (AWS CLI) installed and configured to interact with AWS and create resources.
Initialize AWS resources
CQLReplicator uses AWS serverless services to perform the extract, transform, and load (ETL) process along with data deduplication. Amazon S3 serves as the staging area for data during the CDC process. AWS Glue handles the data pipelines. AWS Identity and Access Management (IAM) provides an AWS Glue service role with permissions to read and write data to both Amazon S3 and Amazon Keyspaces. Amazon CloudWatch will capture reads and write metrics for tables in Amazon Keyspaces. For a quick setup, you can use an AWS CloudFormation template to create an S3 bucket and AWS Glue service role with the necessary permissions to run CQLReplicator.
Use the following commands to run a CloudFormation template to initialize AWS Resources.
Update the connection to the Cassandra cluster
CQLReplicator uses the Cassandra Java driver’s external configuration for connecting to the source Cassandra cluster and the destination Amazon Keyspaces endpoint. The conf directory of the project contains two files: CassandraConnector.conf and KeyspacesConnector.conf. The CassandraConnector.conf file should be used to configure the connection to your self-managed Cassandra cluster, with the same configuration you use for your applications today. The Amazon Keyspaces configuration is preconfigured to work with tables in the AWS Region where you deploy CQLReplicator. Use the following code to update the connection to the cluster:
Initialize CQLReplicator
Now that you have cloned the repository, you run the init functionality to set up an AWS Glue connection and deploy the required artifacts to Amazon S3. AWS Glue accesses these artifacts when running the CQLReplicator jobs. The init takes a few parameters to set up replication from Apache Cassandra to Amazon Keyspaces:
state – CQLReplicator has two states, init and run. The init state initializes the AWS Glue jobs, and the run state runs them.
security-groups – Security groups are required to communicate if your self-managed Cassandra cluster is running in your VPC. CQLReplicator creates an AWS Glue connection for you if you provide security groups and a subnet.
skip-glue-connector – Alternatively, you can skip AWS Glue connection creation, which supports an architecture when the Cassandra cluster endpoints are available outside of a VPC.
region – The Region where the Cassandra cluster and VPC are located.
subnet – A subnet in your VPC hosting Cassandra that will be used to set up an AWS Glue connection.
availability-zone – The Availability Zone of the subnet used for the AWS Glue connection.
glue-iam-role – The IAM service role used by AWS Glue to read and write from Amazon S3 and Amazon Keyspaces.
landing-zone – The S3 bucket used to store AWS Glue artifacts and data processed.
The following command initializes CQLReplicator
Overview of the discovery and replication process
Running CQLReplicator initializes two AWS Glue jobs: discovery and replication. The discovery job is responsible for collecting all the primary keys with the latest timestamp and persisting them in Amazon S3. If the discovery job has run previously, it compares the old and new primary keys to determine the latest changes. It stores the set of changes in Amazon S3 and records the location of the bucket key in a ledger table stored in Amazon Keyspaces. The replication job scans the ledger table in Amazon Keyspaces for new change sets. When a change set is found, it reads the keys and queries the Cassandra cluster for the latest row data. With the complete primary key and latest row values, the data is inserted into the destination Amazon Keyspaces table. Both the discovery job and the replication job run continuously until you pause or stop the migration.
The following diagram highlights the core architecture of the individual processes used in replication and the flow of data.
Running CQLReplicator discovery and replication
Running CQLReplicator requires additional parameters for the source table, destination table, and a column to capture the time of a write. By default, we recommend eight tiles to spawn one discovery job and eight replication jobs. A tile is synonymous with an AWS Glue job. Each replication job has four G.2X 2 DPU workers by default. The default of eight is acceptable for tables up to 1 TB in data. For larger tables, check the sizing section later in this post to understand the number of jobs to run. The following parameters are used to launch the CQLReplicator process:
state – CQLReplicator has two states: init and run. The init initializes the AWS Glue jobs, and the run state runs them.
landing-zone – The S3 bucket used to store AWS Glue artifacts and data processed.
region – The region where the Cassandra cluster and VPC are located.
src-keyspace – The source keyspace you want to migrate.
src-table – The source table you want to migrate.
trg-keyspace – The target keyspace you want to migrate.
trg-table – The target table you want to migrate.
writetime-column – The column in the source table that used for the write time of the insert into the destination table. Choose a field that is updated for each mutation. You can leave this field out to always overwrite the destination.
inc-traffic – Omitting this flag results in only a single copy without continuous changes.
–override-rows-per-worker – The default is 1 million rows per worker.
–worker-type – You can use this to modify the default AWS Glue worker type. The default is G0.25X.
The following command runs the CQLReplicator process:
Results
To demonstrate CQLReplicator migrating data from Cassandra to Keyspaces, we ran the following test. The source table was continuously populated with new rows using the easy-cass-stress test. A target table was created in Amazon Keyspaces with the same model as the source table. CQLReplicator was then configured to copy data from the source table to the target table using the steps mentioned earlier.
In the following screenshot you can see metrics of when CQLReplicator reads and writes from the source and target tables. The discovery job makes an initial copy of all existing records from the source table resulting in large reads. The primary keys are stored in multiple Parquet files on Amazon S3, and the migration ledger in Amazon Keyspaces is updated with the S3 object locations. The replication job picks up the changes from the migration ledger, reads the S3 object data, retrieves the latest rows from the source table, and writes those rows into the target table. CQLReplicator continues to scan for new changes to the source table and repeats the replication process.
You can also gather change statistics from CQLReplicator by running the stats mode. The stats mode obtains the number of replicated rows after the initial data loading phase. You can use the following command to run the stats mode. Provide details from the previous command, such as the number of tiles, source and target tables, region, and landing zone. Additionally, the state should be set to stats, and the flag –replication-stats-enabled should be included.
The following screenshot shows the output of the stats command, providing the number of inserts, updates, and deletes from each job.
Sizing
Size your CQLReplicator based on the estimated number of rows and primary keys in your table. The following example estimates can help you determine the number of tiles to configure based on the table size you want to migrate:
For 1 TB or 1 billion rows, you could deploy 8 tiles with 32 million rows per worker. Set –override-rows-per-worker to 32,000,000.
For 10 TB or 10 billion rows, you could deploy 80 tiles with 32 million rows per worker. Set –override-rows-per-worker to 32,000,000.
This is a starting place; you can adjust the worker size with the –worker-type flag or increase the number of rows per worker using the –override-rows-per-worker flag. Each worker consumes an IP address in your VPC. Make sure you have enough available IP addresses for the number of workers deployed if you’re using an AWS Glue connection. When deploying CQLReplicator, monitor the load placed on the Cassandra cluster before and after deployment. Start with a conservative number of workers and slowly increase as necessary.
Shut down CQLReplicator
To shut down the CQLReplicator program and its associated AWS Glue jobs, run the request-stop state:
Clean up
Complete the following steps to clean up your resources:
Delete the AWS Glue jobs created by CQLReplicator.
Delete the stack using the AWS CloudFormation console.
Delete the migration keyspace using the Amazon Keyspaces console.
Delete any tables created in Amazon Keyspaces used in the migration process.
Conclusion
Amazon Keyspaces is a scalable, fully managed, serverless, Apache Cassandra compatible database service that provides 99.999% availability. With Amazon Keyspaces you can modernize your Cassandra workloads and focus on innovating your applications. With CQL Replicator you can migrate Apache Cassandra workloads to Amazon Keyspaces more easily with minimal downtime.
By following the steps outlined in this post, you can use AWS serverless services to set up and run CQLReplicator jobs, configure the necessary resources, adjust replication settings based on dataset size, and monitor the migration process using CloudWatch. Using CQLReplicator to migrate your data to Amazon Keyspaces will allow you to focus on innovation, accelerate time-to-market for new features, and transform your business operations.
Download the CQLReplicator project today from the GitHub repo to begin your migration.
About the authors
Michael Raney is a Principal Specialist Solutions Architect based in New York. He works with customers to modernize their legacy database workloads to a serverless architecture. Michael has spent over a decade building distributed systems for high-scale and low-latency stateful applications.
Nikolai Kolesnikov is a Principal Data Architect and has over 20 years of experience in data engineering, architecture, and building distributed applications. He helps AWS customers build highly scalable applications using Amazon Keyspaces and Amazon DynamoDB. He also leads Amazon Keyspaces ProServe migration engagements.
Sourav Biswas is a Senior DocumentDB Specialist Solutions Architect at AWS. He has been helping AWS customers successfully adopt Amazon DocumentDB and implement best practices around it. Before joining AWS, he worked extensively as an application developer and solutions architect for various NoSQL vendors.
Source: Read More