Data lakes are a common pattern for building a modern data platform. Data lakes offer organizations a centralized way for storing, processing, and securing data. Data teams need to make sure data lakes provide high-quality data to serve a range of use cases, including data visualizations, real-time analytics, and machine learning (ML), allowing organizations to generate business value. At the same time, data teams need to make sure the data lake reflects changes to data and schemas in the underlying source systems.
In this post, we show you how to build data pipelines to replicate data from Microsoft SQL Server to a data lake in Amazon Simple Storage Service (Amazon S3) using AWS Database Migration Service (AWS DMS). You can extend the solution presented in this post to other database engines like PostgreSQL, MySQL, and Oracle. These data pipelines are optimized for both cost and performance and implement features like schema evolution to automate manual tasks. Schema evolution is a process of modifying the structure of a database schema over time to accommodate evolving business requirements. After the data lands in the data lake, we build extract, transform, and load (ETL) pipelines for transforming the data to create data marts to serve different use cases.
This is a three-part series in which we discuss the end-to-end process of building a data lake from a legacy SQL Server database. You can check out the next posts in the series:
- Part 2: Build a data lake using AWS DMS data on Apache Iceberg
- Part 3: Lake house data processing layer
Solution overview
The following diagram shows the architecture of this solution.
Choosing the right tools and tech stack to build a data lake is critical in order to build a scalable solution and deliver data with a short time to market. In this post, we go over the process of building a data lake using AWS Glue ETL jobs and Amazon Redshift Spectrum, provide rationale behind the different decisions, and share best practices when building such a platform.
AWS DMS is a managed migration and replication service that helps you move database and analytics workloads to AWS quickly, securely, and with minimal downtime and zero data loss. AWS DMS can migrate existing data (full load) and replicate ongoing changes from SQL Server databases. Refer to Sources for AWS DMS for the supported versions and editions. AWS DMS works by setting up a source and destination for the replication and running a replication task on an AWS DMS replication instance. The AWS DMS replication instance can be either a provisioned or serverless instance. AWS DMS Serverless provides automatic provisioning, scaling, and built-in high availability. In this example, we use AWS DMS Serverless for replicating the data. Refer to Supported Engine Versions to see the supported database types, and review the limitations using SQL Server as source and Amazon S3 as target.
In our setup, we have configured the following components:
- Source endpoint – This is database connection information pointing to the source Microsoft SQL database. The AWS DMS replication instance uses this information to connect to the source database.
- Target endpoint – This is an S3 bucket where the data will arrive. The AWS DMS replication instance uses this information to connect to the target S3 bucket. Refer to Endpoint settings when using Amazon S3 as a target for AWS DMS for the different options you can use to configure your S3 target database.
- AWS DMS replication task – The replication task runs on the replication instance and migrates data from the source endpoint to the target endpoint. The type of task is full load and change data capture (CDC). AWS DMS Serverless automatically scales the capacity units based on load on replication.
AWS DMS can start a full data load while capturing changes on the source. The full load migration phase populates the target with a copy of the source data. After the full load is complete, captured changes are applied to the target. When the changes have been applied, the task goes into ongoing CDC mode, which scans the source for new data or schema changes. AWS DMS is able to capture schema changes on the source, such as the addition of new columns or tables, and apply these to the destination without the need for any additional code or manual steps. For more details, refer to DDL statements supported by AWS DMS.
When the target of AWS DMS is set to Amazon S3, incoming schema changes aren’t written to the destination files. Instead, AWS DMS writes the new defining data schema to the CDC files. AWS DMS supports DDL commands for CDC. Refer to Limitations to using Amazon S3 as target for a complete list of supported DDL commands. When a column is added, dropped, or renamed on the source database, no ALTER statement is recorded in the target S3 bucket, and AWS DMS doesn’t alter previously created records to match the new structure. After the change, AWS DMS creates any new records using the new table structure.
The data from AWS DMS is stored on S3 objects as CSV or Parquet format. AWS DMS creates two types of files:
- Full load files – These are the entire table data dump to one or more files per table in the database. You can control the maximum file size by using the MaxFileSize endpoint setting.
- Transaction files – These files contain the new data that was loaded or changed in the database either during or after the full load. In the transaction file, we get the table header as an attribute. By default, AWS DMS stores the changes to each database table without a transaction order. To capture the changes in transaction order, refer to Capturing data changes (CDC) including transaction order on the S3 target.
After the files are loaded into Amazon S3, you can load this data into the data lake, which we discuss in later posts in this series.
In this solution, we use a SQL Server database on Amazon Elastic Compute Cloud (Amazon EC2) or Amazon Relational Database Service (Amazon RDS). The same architecture is applicable to databases hosted on an on-premises data center with connectivity to the AWS Cloud.
The following diagram illustrates the solution architecture.
Prerequisites
To follow the solution, you need the following prerequisites:
- An AWS Identity and Access Management (IAM) user with AWS Command Line Interface (AWS CLI) access to an AWS account.
- IAM privileges to do the following:
- Create AWS DMS replication instance, endpoints, and replication tasks.
- Create the S3 bucket.
- Microsoft SQL Server Developer, Standard or Enterprise Edition, either on premises, on an EC2 instance, or in Amazon RDS, with the AdventureWorks sample database. Refer to Sources for AWS DMS for supported versions and editions of different source databases. If you have an unsupported edition of SQL Server on premises, you can upgrade to a supported edition.
- SQL Server Management Studio (SSMS) installed in your local or an EC2 instance and connect to the source database. This is required to run T-SQL queries on the database.
- Connectivity validated between the source SQL Server and AWS DMS replication instance. If you’re using AWS Secrets Manager to connect to the source database, you need a Secrets Manager VPC endpoint.
- With Amazon S3 as target, Full LOB mode is not supported as of writing this post. Find the max LOB size in the source database. This will be used while creating the replication task in Limited LOB mode.
Create the AWS DMS source endpoint
The first step to replicate the data is to create the endpoints pointing to your source and target database or data store. In this scenario, we use SQL Server Standard Edition as the source endpoint. The following steps demonstrate how to create a source endpoint:
- On the AWS DMS console, choose Endpoints in the navigation pane.
- Choose Create endpoint.
- Select Source endpoint for Endpoint type.
- Because we’re using an RDS DB instance as our source, select Select RDS DB instance.
- For RDS instance, choose your source RDS instance for replicating data.
- For Endpoint identifier, enter a name for your endpoint.
- For Source engine, choose Microsoft SQL Server.
- For Access to endpoint database, select AWS Secrets Manager.
You can also provide user credentials with a user name and password manually. For this post, we use Secrets Manager and the connection information is already updated in Secrets Manager. For more information about creating a secret, refer to Create an AWS Secrets Manager database secret. - For IAM role, enter the ARN to the IAM role that has access to Secrets Manager.
For more information about the appropriate permissions to replicate data using AWS DMS, refer to Using secrets to access AWS Database Migration Service endpoints. - For Secure Socket Layer (SSL) mode, choose None.
Alternatively, you can encrypt connections for the source and target endpoints by using SSL. Refer to Using SSL with AWS Database Migration Service for details. - For Database name, enter the name of your source database (for this post, AdventureWorks).
For more information about creating your source endpoint, refer to Creating source and target endpoints and Using SQL Server as source.
Create the AWS DMS target endpoint
Complete the following steps to create the target endpoint:
- On the AWS DMS console, choose Endpoints in the navigation pane.
- Choose Create endpoint.
- Select Target endpoint for Endpoint type.
- For Endpoint identifier, enter a name for your endpoint.
- For Target engine, choose Amazon S3, because we’re moving our data to Amazon S3 in this use case.
- For Amazon Resource Name (ARN) for service access role, enter the ARN for the IAM role that has access to the target bucket.
For more information about creating this role, refer to Prerequisites for using Amazon S3 as a target. - For Bucket name, enter the name of the S3 bucket.
If the bucket doesn’t exist already, create a new bucket. - For Bucket folder, enter the name of the folder.
AWS DMS creates a folder for each source table containing data under the specified bucket folder. AWS DMS writes the full load and CDC CSV files to this path. - For Endpoint settings, choose the settings as shown in the following screenshot.
These settings determine the format for the CSV file that AWS DMS will write to Amazon S3. The TimestampColumnName setting is used to filter the recent transactions. For information about the endpoint settings, refer to Endpoint settings when using Amazon S3 as a target for AWS DMS. In our scenario, we create the files in CSV format with operations (I for insert, U for update, D for delete) included for CDC files. For more details, refer to Indicating source DB operations in migrated S3 data. We capture the source database changes and preserve the transaction order. For more information, refer to Capturing data changes (CDC) including transaction order on the S3 target.
For more information about creating your target endpoint, refer to Using Amazon S3 as a target for AWS Database Migration Service.
Enable CDC on the source database
To enable CDC on your source SQL Server database, the database needs to be in Full or Bulk logged recovery mode. For more information, refer to Prerequisites for using ongoing replication (CDC) from a SQL Server source.
Depending on the type of SQL Server edition, there are different ways to enable CDC for SQL Server on Amazon EC2 or on premises. AWS DMS can automatically enable CDC on the source database. For this, you need to add the extra connection attribute SetUpMsCdcForTables=true in the source endpoint.
When using SQL Server on Amazon RDS as your source, to enable CDC, refer to Setting up ongoing replication on a cloud SQL Server DB instance.
Configure AWS DMS Serverless replication
For data migration, we can either use an AWS DMS replication provisioned instance or AWS DMS Serverless. In our scenario, we use AWS DMS Serverless. The advantage of using AWS DMS Serverless is that it can automatically provision and scale the capacity for data migration and ongoing replication. For more information, refer to Supported Engine Versions. For more details on how AWS DMS Serverless works, refer to AWS DMS Serverless components.
The following steps walk through how to set up AWS DMS Serverless replication. For complete details on the steps, refer to Creating a serverless replication.
- On the AWS DMS console, under Migrate Data in the navigation pane, choose Serverless replications.
- Choose Create replication.
- For Name, enter a name for your replication.
- For Source database endpoint, choose the source endpoint you created.
- For Target database endpoint, choose the target endpoint you created.
- For Replication type, select Full load and change data capture (CDC).
- For Target table preparation mode, select Drop tables on target.
- For Stop task after full load completes, select Don’t stop.
- Because AWS DMS doesn’t support full LOB mode for Amazon S3 as target, for Include LOB columns in replication, select Limited LOB mode and enter the maximum LOB size of your database.
- Select Turn on CloudWatch logs.
- In the Table mappings section, provide the schema name and table name you want to replicate. In our case, we are replicating all schemas and tables (% is a wildcard to replicate all tables and schemas).
- In the Compute settings section, choose the VPC and subnet group that the replication instance can use.
- Choose the VPC security group that has access to the source database and target S3 bucket. This is required for connectivity for the replication instance to source and targets.
- Under Availability, choose Single-AZ for the replication instance. You can use Multi-AZ for high availability when replicating production databases.
- Under Capacity, choose the minimum and maximum AWS DMS capacity units.
AWS DMS capacity units automatically scale up to the maximum capacity units based on the load on replication instance. It automatically scales down to the minimum AWS DMS capacity units when the load is less on the replication instance. - Choose Create replication.
- After the replication is created, on the Actions menu, choose Start to start the replication.
It can take several minutes to provision the required capacity for replication. You can monitor the capacity utilization on the Monitoring tab.
On the Table statistics tab, you can monitor the tables being replicated.
On the Amazon S3 console, you can navigate to the S3 target bucket and verify the full load data and CDC data were written to the bucket path.
Data is now migrated from the SQL Server database to Amazon S3. The next step is to load this data into a transactional data lake and process raw data to create data marts.
Clean up
Complete the following steps to clean up the resources created in this post:
- Delete the AWS DMS replication:
- On the AWS DMS console, choose Serverless replications in the navigation pane.
- Select the replication you created and on the Actions menu, choose Delete.
- Delete the AWS DMS endpoint:
- On the AWS DMS console, choose Endpoints in the navigation pane.
- Select the endpoint you created and on the Actions menu, choose Delete.
- Empty and delete the S3 bucket:
- On the Amazon S3 console, choose Buckets in the navigation pane.
- Select the bucket you created and on the Actions menu, choose Empty.
- Enter permanently delete to confirm deletion, then choose Empty.
- Select the bucket and on the Actions menu, choose Delete.
- Enter the name of the bucket to confirm deletion, then choose Delete bucket.
Conclusion
In this post, we discussed how you can migrate data from a SQL Server Standard Edition database to Amazon S3 using AWS DMS Serverless. We also showed how the data in the S3 bucket is written in CSV format for full load and CDC and the transaction order is preserved. AWS DMS Serverless replication takes away the maintenance overhead of managing the scaling of source systems and varying loads, and using CSV format and Amazon S3 offers a cost-effective solution for storing large amounts of raw data coming from sources.
In Part 2, we discuss how transaction order makes it possible to load this data into a transactional data lake. In Part 3, we show how to process raw data to create data marts.
About the Authors
Shaheer Mansoor is a Senior Machine Learning Engineer at AWS, where he specializes in developing cutting-edge machine learning platforms. His expertise lies in creating scalable infrastructure to support advanced AI solutions. His focus areas are MLOps, feature stores, data lakes, model hosting, and generative AI.
Anoop Kumar K M is a Data Architect at AWS with a focus in the data and analytics area. He helps customers in building scalable data platforms and in their enterprise data strategy. His areas of interest are data platforms, data analytics, security, file systems, and operating systems. Personally, Anoop loves to travel and enjoys reading books in crime fiction and the financial domain.
Sreenivas Nettem is a Lead Database Consultant at AWS Professional Services. He has experience working with Microsoft technologies with a specialization in SQL Server. He works closely with customers to help migrate and modernize their databases to AWS.
Source: Read More