This post is co-authored with Eric Boyer and Maria Hristova of Firmex.
Firmex is a leading Virtual Data Room provider with more than 20,000 new rooms opened every year. Customers upload thousands of documents of any size to be shared and downloaded tens of thousands of times while enforcing access control and permissions. Firmex has a long-proven history of security and availability that their customers rely on for their due diligence needs.
When using Firmex’s SaaS service, customers want to make sure their data is isolated and protected from other customers using the service. To meet that requirement, Firmex employs a strict data isolation strategy of containing customer data in a set of individual databases. Each customer has one tenant database and any number of additional databases to host each of their projects. For the application’s behavior, it relies heavily on over a hundred stored procedures for data access and modification.
The data architecture shown above offers numerous benefits in the context of Virtual Data Rooms, including simple data partitioning, security isolation, access control, and rebalancing. Firmex had around 65,000 databases running across four Microsoft SQL Servers in an on-premises data center deployed following this architecture. Many small databases allowed Firmex’s engineers to invest in horizontal scaling instead of needing to invest in ever larger machines. However, there were also drawbacks related to the sheer number of databases. Some of the problems included slowly approaching the maximum limit of the number of databases on each server of 32,000, managing backups for the large set of databases, and managing changes across the databases. Additionally, the licensing terms for this setup weren’t favorable, and upgrading was also more difficult.
In this post, we discuss how and why Firmex migrated 65,000 databases heterogeneously from their on-premises SQL Server to Amazon Aurora PostgreSQL-Compatible Edition.
Challenges with the Current Architecture
The absolute number of databases that Firmex managed was a closely monitored metric due to the hard limit in SQL Server. In 2020, Firmex’s engineers determined that they were fast approaching the database limit. At the current creation levels, they had less than one year until the limit was hit. Cloud providers limit the number of databases that can be created on any instance. This was the major roadblock in adopting the cloud for Firmex’s data storage. Finding a solution that fit their data model in the cloud was the priority.
In their effort to address the growing number of databases, they proceeded with the approach of shifting tenant isolation from a database-based isolation solution to a schema-based isolation solution. Having tenants separated by schemas was to allow the application to operate within database limit restrictions and give the same isolation benefits. A comparison was done with SQL Server schema isolation compared to PostgreSQL schema isolation. The major benefit of PostgreSQL in comparison was being able to use the search_path feature by specifying a customer data schema, a code schema to share functions, and a common schema. SQL Server was unable to provide this level of control on object resolution and would require code to be deployed to each individual schema, taking too long for deployments. There was also a need to evaluate the potential for schema isolation. Firmex’s engineers along with the team of AWS Specialist Solutions Architects explored the other features as well as the cost, reliability, and performance. Aurora PostgreSQL-Compatible had very attractive price-performance ratios, and Firmex would be able to use AWS Graviton processors to get a larger memory allocation to benefit the application. The benefits of having multi-AZ failovers simplified their processes for providing high availability with the service, with Aurora taking over replication management across Availability Zones.
Moving from SQL Server to PostgreSQL was a daunting task. Not only was there a migration of data to contend with, but a complete transformation of stored procedures, DDL, and data isolation. Firmex was already on their journey to migrate services to the cloud and had utilized Aurora PostgreSQL-Compatible for some of their services, including a session handler and an email scheduler. The pricing and performance of Aurora PostgreSQL-Compatible made it straightforward to explore this as an option for storing their customer data as well. For the migration, they decided to use AWS Database Migration Service (AWS DMS) and the AWS Schema Conversion Tool (AWS SCT).
The following diagram shows the heterogeneous migration architecture from Microsoft SQL Server to an Aurora PostgreSQL database using AWS Database Migration Service Schema Conversion (DMS SC), a fully managed service. While this diagram demonstrates the DMS SC approach, this blog post focuses on achieving similar results using AWS Schema Conversion Tool (SCT). For more information about the managed service option, see Migrating SQL Server databases to Amazon Aurora PostgreSQL with DMS Schema Conversion.
A Path Forward
Firmex’s team started working on converting their stored procedures before the migration project started to make sure they could prove that the application worked correctly on the PostgreSQL instances. Due to the large amount of logic in the T-SQL stored procedures, an automated migration tool proved difficult to use without manual intervention. This took over 9 months of effort to develop replacement patterns and testing each of the 260 stored procedures to confirm that it worked correctly in the new server environment. There were enough differences between the database engines and the data access libraries being used that a simple drop-in replacement didn’t work for the majority of the database calls.
The following diagram illustrates the simplified architecture of the Firmex migration.
SCT Challenges
The application data definition language for both the tenant-isolation and the project-isolation databases were converted to PostgreSQL schemas using the AWS SCT.
Firmex’s SQL Server databases were using a case-insensitive collation that wasn’t directly mappable to PostgreSQL. This required their engineers to define a case-insensitive collation and attach it to any of the columns in the database that required case insensitivity.
Data Migration Challenges
AWS DMS was used to migrate the data from SQL Server to PostgreSQL. Some of the biggest constraints were the large number of databases and that the application required all the databases to be in the same location as their application to maintain low latency accessibility. Below are the challenges we worked through during the database migration.
Endpoint Limits
Ideally when using AWS DMS, the migration strategy to achieve the best performance is to have one task created per database for continuous data transfer between the two systems over a long period of time. But for Firmex’s application, this would require 65,000 tasks. This exceeded the quota of 600 tasks per account and the endpoint limit of 1,000 per account (Note, if your migration requires a limit increase, please work with AWS Support and your AWS account team – see DMS API limits).
AWS DMS Task limits
Due to the large number of databases and limits on the number of AWS DMS Tasks that are allowed to be created, their engineers were unable to create enough tasks at once to migrate all of the databases in a continuous manner. To work around this limitation, they created one-shot migration tasks on an as-needed basis. Custom tooling was developed using the C# AWS SDK to properly setup the replication endpoints, create AWS DMS Tasks, execute the one-shot migration tasks, and record the final state all while limiting the rate at which these operations could occur to stay below the DMS API limits.
Data Transfer
Firmex’s engineers formulated a process to complete the following: provision a VPN tunnel, set up the AWS DMS tasks and configure the endpoints for the migration, set the customer to maintenance mode, start the AWS DMS migration tasks, monitor the progress, remove the resulting tasks and endpoints after the migration was complete, and move onto the next database that was targeted to be migrated. Many migration tasks occurred in parallel to make sure the migration process completed within the time constraints and effectively used the resources they had dedicated to this process. The progress and state of the migration was recorded in their application database to make sure they knew which customers were migrated and what state they were in.
API Limits
Due to the aggressive parallelization and the size of the data involved, the migration process ran into multiple issues that Firmex needed to work through. The first was a bandwidth constraint due to a limit of their data center. This caused multiple difficult-to-diagnose issues during the migration because there would be dropped packets (SSL errors, connection hangs) when the bandwidth limit was exceeded.
The second limitation was the frequency that the process was creating and removing tasks. Because each task was targeting a different database, a new task was created, running into limits within AWS DMS. Along with this, the team identified that the endpoints were not being deleted quick enough and still counted against the limit. This required Firmex to ask for quota limit increases in some areas and work around API call rates that couldn’t be raised. Working with their AWS account team, they were able to increase the number of endpoints per replication instance from 100 to 200 and increase the account limit from 1,000 to 3,000.
To work around the API limit where it could not be raised, Firmex needed to push as many API calls outside the migration window by setting up replication instances and endpoints so that the calls during the migration window had a better chance at succeeding.
Maintenance Windows
Migration windows were the next problem. Firmex needs to make sure they’re hitting their availability SLA for their customers. The first waves of migrations were targeting tenants that were used for testing and validation. These initial waves gave them confidence in the approach as well as the timings related to the number and sizes of projects.
Migrating an actual customer meant making the individual tenant unavailable for a period of time using a built-in application feature. Because customers had a variable number of projects in their site, this required carefully analyzing the behavior of the migration to match the available migration window. Targeted customer windows made sure that downtime only affected a small set of customers. Firmex’s engineering team worked closely with their customer support team to fit customers within multiple small windows of over 45 migration waves over the course of 3 months. Some migration waves were dedicated to one customer who had over a thousand databases to migrate.
Database Connection Counts
As more tenants were migrated to AWS, a high connection count to the databases was observed in Amazon RDS Performance Insights. To connect to a tenant schema, the search_path would be included in the connection string. Connections were created using the library npgsql. Their application was creating thousands of connections to the database with small per-schema connection pools. Unique connection strings per schema caused the library to not share connections. PostgreSQL has an OS process per connection model, allocating resources like memory for each one. This caused high memory usage and occasionally resulted in I/O pressure on the machines. The application and the migration process competed for connections on the server, which caused high connection usage, running up against the max connection limits. Being on Aurora, they were able to effortlessly scale up the database’s memory to handle the increased load until they addressed how the connections to the databases were created and managed. Their application split out the setting of the search_path from the initial connection string allowing the connections to be pooled and reused. This resulted in a 10x improvement bringing their connection count from 4,000-6,000 down to 200-300.
Performance
Firmex’s T-SQL stored procedures relied heavily on temporary tables to store intermediate results and pass data between stored procedure calls. In PostgreSQL, this requires heavy usage of the system tables for tracking, which develops additional pressure on the auto-vacuum process. They were seeing auto-vacuum processes that would not complete and error messages related to the stats_collector being unresponsive pointing to temporary tables being the source of the problem. One of the most heavily used stored procedures that utilized temporary tables and targeted the writer node was refactored and re-targeted to the reader node causing the load on the writer database to drop from between 2-4 vCPUs down to < 0.5 while improving the average response times by 75% (from 113ms to 27ms).
The built-in Performance Insights monitoring tool in Aurora PostgreSQL-Compatible has proven to be valuable to Firmex in analyzing the database and IO metrics and making sure they’re comfortable to downsize or upsize instance types and prioritize their efforts in terms of query performance tuning and application optimization. They used a combination of vCPU load and memory metrics to determine the appropriate sizing.
Given how straightforward it is to add additional reader nodes to an Aurora cluster, one of the efforts they continue to concentrate on is redirect read-only workloads to the readers, which leads to better resource utilization on the writer node, allowing their application to scale appropriately.
Overall, the process to redevelop the application to operate on Aurora PostgreSQL-Compatible took dedicated effort. Over the course of 18 months, Firmex was able to migrate their T-SQL to PL/pgSQL and work through a simple-to-understand migration plan. Post-migration, they have been able to realize the benefits using Aurora PostgreSQL-Compatible to right-size their database instances, reduce operating costs, and take advantage of automatic and fast failovers, which in turn has helped Firmex meet availability targets.
Forward thinking
Because their application was designed for SQL Server, there are some future improvements that Firmex is considering to further optimize their usage of PostgreSQL. Current areas of high focus include database connection management using Amazon RDS Proxy, reducing temporary table usage, auto-vacuum tuning, using the more expressive type system, using the reader nodes more effectively, and being able to take advantage of blue/green deployments to reduce engine upgrade times to an absolute minimum. In the future, they plan to use the connection management capabilities of RDS Proxy to drive value in their environment.
Conclusion
Over the course of 18 months, Firmex was able to migrate their Microsoft SQL Server Database from on-premise data centers to Amazon Aurora for PostgreSQL in AWS while meeting their customer SLA. Switching to Amazon Aurora PostgreSQL saved them over $125k CAD in yearly licensing costs and freed them from having to purchase new hardware. Post migration, after some additional performance improvements, Firmex has been able to take advantage of the flexibility of Amazon Aurora to reduce the sizes of their 8 production instances from r6g.8xlarge, during the migration, down to r6g.2xlarge. This has saved 75% of the RDS operating costs, and they were able to realise these savings immediately. Relying on a managed service like Amazon Aurora PostgreSQL frees up developers’ time to focus on delivering application features. Aurora’s automatic and fast failovers have helped them meet their availability targets and simplified the process of disaster recovery.
The services and tooling that are now available will help them continue to grow without worrying about limits and give them the flexibility to tackle any challenges they may have in the future.
For further reading about Firmex, check out Firmex’s blog.
About the authors
Eric Boyer is the Senior Lead Architect at Firmex Inc. His expertise lies in designing secure, scalable, and responsive solutions that satisfy customer needs and technical requirements. He has been responsible for setting the direction on Firmex’s transition to the cloud and working to break down their monolithic application into manageable AWS cloud native components. Outside of work he enjoys a good science fiction novel, going on a bike ride and playing board games with his family.
Maria Hristova is a technology executive who specializes in strategic technical transformations and building highly capable diverse teams. She oversaw Firmex’s transition into a cloud first development organization that is better suited to tackle the company’s growth goals. Maria’s strategic vision and leadership were crucial in guiding numerous technical projects to completion. Maria enjoys discussing the finer points of whisky distilling and Taylor Swift over a glass of wine while in the company of her cats.
Kiran Singh is a Senior Database Specialist Solutions Architect at AWS. Kiran specializes in Amazon RDS and Amazon Aurora, and is a subject matter expert for PostgreSQL. Kiran works with Partners to design and implement database migration and modernization strategies and provides assistance to customers for building scalable, secure, performant, and robust database architectures in the AWS Cloud.
Radhika Chakravarty is a Database Solutions Architect with Amazon Web Services. She works with customers and partners by providing technical assistance to design and implement cloud migration projects, helping them to migrate and modernize their existing databases to AWS Cloud.
Dylan Souvage is a Partner Solutions Architect based in Austin, Texas. Dylan loves working with customers to understand their business needs and enable them in their cloud journey. In his spare time, he enjoys going out in nature and going on long road trips.
Daniel Rios is a WWSO Data & AIML Sales Specialist. He works as a trusted advisor to customers by providing guidance on how they can unlock new value streams, solve key business problems, and deliver results for their customers using AWS Data & AIML Services.
Source: Read More