Heroku is a fully managed platform as a service (PaaS) solution that makes it straightforward for developers to deploy, operate, and scale applications on AWS. Founded in 2007 and a part of Salesforce since 2010, Heroku is the chosen platform for millions of applications—from development teams at small startups to large enterprises with large-scale deployments. Keeping all of those applications running smoothly requires significant, ongoing investment in the infrastructure that underpins the platform, using the building blocks that are available from AWS.
In this post, we discuss an infrastructure upgrade that Heroku completed in 2023, migrating the storage backend for their application metrics and alerting features from self-managed Apache Cassandra clusters to Amazon DynamoDB. Heroku was able to complete this migration with zero customer impact. Migrating to DynamoDB increased the reliability of their platform while simultaneously reducing cost to serve. In this blog post, we dive into the overall architecture of the system in question, why they moved to DynamoDB, how they accomplished the migration, and the results they have seen since completing it.
Metrics as a service
Heroku’s application metrics are powered by an internal metrics as a service (MetaaS). MetaaS collects different observations from the applications running on Heroku, like the amount of time it takes to serve a particular HTTP request. These raw observations are aggregated to calculate per-application and per-minute statistics like the median, max, and 99th-percentile response time. The resulting time series metrics are graphed for customers in the Heroku dashboard, as well as used to drive alerting and auto scaling functionality.
At its heart, MetaaS is a high-scale multi-tenant time series data processing platform. Hundreds of thousands of observations per second are initially ingested into Apache Kafka on Heroku. A collection of stream-processing jobs consumes these observations from Kafka, calculates the different metrics Heroku tracks for each customer application, publishes the resulting time series data back to Kafka, and ultimately writes it to a database—originally Apache Cassandra on Amazon Elastic Compute Cloud (Amazon EC2)—for longer-term retention and query. MetaaS’s time series database stores many terabytes of data, with tens of thousands of new data points written every second and several thousand read queries per second at peak.
The following diagram illustrates this original architecture.
The case for DynamoDB
MetaaS’s setup with Cassandra had served Heroku well for many years. Towards the end of 2022, however, the Heroku engineering team began to explore other options for the backend storage. The Kafka clusters that MetaaS uses are provided by Apache Kafka on Heroku, backed by a team with lots of experience maintaining and tuning Kafka clusters. The Cassandra clusters, on the other hand, were specific to MetaaS.
Heroku believed it was important to move to a managed service, operated and maintained by a team of experts in the same way as their Kafka clusters – they only had a very small team of engineers overseeing database infrastructure. They were looking for a managed database on AWS that was fit for storing a very large amount of data in a scalable, distributed system while maintaining fast write performance at any scale. Other teams at Heroku have used DynamoDB for their data storage and processing which have similar requirements for consistency and performance. This made DynamoDB a natural choice for the MetaaS workload.
A careful migration
With the plan made and the code written, all that remained was the task of swapping out the backend storage of a high-scale, high-throughput distributed system without impacting existing customers. The architecture of MetaaS gave Heroku a significant advantage—they already had a set of stream-processing jobs for writing time series data from Kafka to Cassandra. This made it easy for them to stand up a parallel set of stream-processing jobs to write that same data to DynamoDB with no observable impact on the rest of the system as the first step of the migration.
Operational metrics are critical to Heroku customers. To ensure that data was being successfully written to DynamoDB, the team implemented a series of different tests that built on their traditional unit and integration testing. They began running a small percentage of read queries to MetaaS that read from both Cassandra and DynamoDB in order to confirm that data was consistent in both databases. Any queries that produced different results between the two code paths were logged. After testing in pre-production environments, they incrementally dialed the experiment up until 100% of queries were being run through both codepaths. This change also had no observable impact, because customers still received the results from the Cassandra codepath, however it allowed them to find and fix a couple of tricky edge cases that had slipped through their more traditional unit and integration testing.
A smooth landing
With the data validation experiment showing that DynamoDB was consistently returning the same results as Cassandra, it was just a matter of time until they were ready to cut over. MetaaS stores data for no more than 30 days, after which it ages out and is deleted (for DynamoDB, using the convenient TTL feature). This meant that there was no need to orchestrate a lift-and-shift of historical data from Cassandra to DynamoDB. After validating that the same data was being written to both places, they could simply wait until the two databases were in sync.
Starting again with their test environments, they began to incrementally cut queries over to only read from DynamoDB, moving carefully in case there were any customer reports of weird behavior that had somehow been missed earlier. There were none, and 100% of queries to MetaaS have been served from DynamoDB since May of 2023. Heroku waited a few weeks just to be sure that they wouldn’t need to roll the change back.
The following diagram illustrates the updated architecture.
Results
After more than a year of experience under their belt, Heroku is feeling pleased about their choice of a managed database service. DynamoDB has been reliable and performant at scale. Heroku has reduced the operational overhead of patching and tuning database servers by as much as 90%, accomplishing their original goal. Also, DynamoDB has turned out to be both faster and cheaper than their previous self-hosted Cassandra clusters on Amazon EC2. It has reduced their maximum latency for queries by 75% and reduced their cost by approximately 50%.
For example, as shown in the following graph, before 18:00, they were querying Cassandra and seeing pretty regular spikes in p99 latency, whereas after 18:00, they were querying DynamoDB and not seeing those spikes any more.
Lessons Learned
Heroku services access DynamoDB through VPC endpoints. AWS Identity and Access Management (IAM) policies are set to require that traffic is sent to DynamoDB via those VPC endpoints, which allows applications in your VPC to access DynamoDB with no exposure to the public internet.
Heroku uses DynamoDB auto scaling to manage the provisioned read capacity for their MetaaS tables, because query traffic naturally varies throughout the week depending on how many customers are looking at metrics in the Heroku Dashboard. MetaaS’s write pattern is much more predictable, so they were able to manually dial in the write capacity for their tables, exceeding auto scaling’s maximum target of 90% utilization. For workloads that have an average consumption with a few spikes, auto scaling can be a great way to save money compared to provisioning for peak capacity; for predictable workloads, it can help you provision exactly the amount of capacity you need to save costs.
Writes to DynamoDB always consume at least one write capacity unit (WCU), but may consume more based on the size of the record being written—1 WCU per KB of data, up to a record size limit of 400 KB in DynamoDB. Although most of the records MetaaS stores are very small, they discovered some outliers that were slightly over the 400 KB limit. Heroku was able to solve that problem (and also save some money) by gzip-compressing these larger records before writing them to DynamoDB. MetaaS data is queried relatively infrequently, so the CPU-time cost of compressing and decompressing the data is negligible compared to the reduction in WCUs and the convenience of not having to change their schema.
Lastly, early on in their testing, they were hitting some surprising performance bottlenecks writing data into DynamoDB. The limiting factor turned out to be a lack of configuration of the HTTP client used by the DynamoDB SDK—for high-scale workloads, it’s worth doing some tuning and not just going with the defaults. They saw significant improvements when they increased the number of concurrent connections allowed to DynamoDB, adjusted the number of idle connections to keep open for future reuse, and configured timeouts and retries that were more aggressive than the defaults.
Conclusion
In this post, we discussed how Heroku upgraded their infrastructure by migrating from self-managed Cassandra clusters to DynamoDB. Heroku was able to complete this migration with zero customer impact, increasing the reliability of their platform while reducing costs. Heroku lowered their cloud infrastructure costs by up to 50% after moving from self-managed Cassandra on EC2 to DynamoDB. The improvement in performance is because DynamoDB query performance is a lot more predictable – with Cassandra they had occasional latency spikes up to ~80ms at peak traffic. With DynamoDB they are running queries consistently under 20ms, which has significantly reduced the latency of delivering metrics results to their customers.
Do you have any other tips or questions about running high-scale workloads on DynamoDB? Let us know in the comments! To get started with DynamoDB, please see the Developer Guide and the AWS DynamoDB Console.
This post is a joint collaboration between Heroku and AWS and is being cross-published on both the Heroku Blog and the AWSÂ Database Blog.
About the Authors
Rielah De Jesus is a Principal Solutions Architect at AWS who has successfully helped various enterprise customers in the DC, Maryland, and Virginia area move to the cloud. In her current role, she acts as a customer advocate and technical advisor focused on helping organizations like Salesforce achieve success on the AWS platform. She is also a staunch supporter of Women in IT and is very passionate about finding ways to creatively use technology and data to solve everyday challenges.
David Murray is a Software Architect at Salesforce, currently working on all things backend at Heroku. He has previous work experience in the areas of networking, databases, security, technical writing, and delivering AWS re:Invent talks about his cat Sammy. When he’s not thinking about computers, he likes to row boats, ride bikes, and play board games.
Source: Read More