Resiliency in the context of database workloads refers to the ability of a database system to withstand and recover from various types of failures or disruptions, such as hardware failures, software bugs, network outages, or even natural disasters. Resilient database systems are designed to maintain data integrity, availability, and performance even in the face of these challenges. Resiliency is crucial for database workloads because databases are often the backbone of critical business applications and systems. Downtime or data loss can have severe consequences, including lost revenue, customer dissatisfaction, regulatory compliance issues, and reputational damage. By implementing resilient database architectures and strategies, organizations can make sure that their data and applications remain accessible and reliable, even during unexpected events or failures. This, in turn, helps maintain business continuity, reduce the risk of data loss, and provide a better overall user experience for customers and stakeholders.
Availability is a key metric used to quantitatively measure resiliency. High availability (HA) focuses on preventive measures to handle failures and disruptions. Whereas, disaster recovery (DR) is how to recover from a database failure. Both HA and DR are completely separate, yet equally important, facets of a highly resilient database architecture. Recovery Point Objective (RPO) and Recovery Time Objective (RTO) are two key metrics that are closely related to the concept of resiliency in database workloads. RPO refers to the maximum acceptable amount of data loss that an organization can tolerate in the event of a disruption or failure. It represents the point in time to which data must be recovered, making sure that critical information is not lost beyond that point. A lower RPO means that the organization can tolerate less data loss, which is generally desirable for mission-critical applications. RTO refers to the maximum acceptable time it takes to restore normal operations and access to data after a disruption or failure. It represents the maximum downtime that an organization can tolerate before the impact on the business becomes unacceptable. A lower RTO means that the organization can restore operations more quickly, which is also desirable for critical applications. By defining and aligning RPO and RTO, organizations can design and implement resilient database architectures and DR strategies that provide data protection and availability, even in the face of unexpected events or failures. This, in turn, helps maintain business continuity and minimize the impact of disruptions on the organization’s operations and customers.
AWS provides multiple options to run your Oracle Database workloads, including fully managed database service options like Amazon Relational Database Service (Amazon RDS) for Oracle and Amazon RDS Custom for Oracle. Databases that don’t qualify for Amazon RDS for Oracle due to known limitations can be deployed on Amazon Elastic Compute Cloud (Amazon EC2), which falls under self-managed. This means that customer is responsible for database installation, all aspects of management including OS, DB upgrades, backups and maintenance. You can use the inherent qualities of the AWS Cloud, such as Amazon EC2 and Amazon Elastic Block Store (Amazon EBS) provisioning, scalability, elasticity, and geographic footprint, to architect a highly resilient and performant database environment for demanding business applications.
In this post, we dive into the various architecture patterns and options available for both compute and storage layers while configuring your self-managed Oracle databases on Amazon EC2 to comply with your HA and DR requirements.
Architecture patterns
Oracle databases on Amazon EC2 support Oracle Data Guard, Oracle Active DataGuard, Oracle GoldenGate, AWS Database Migration Service (AWS DMS), and Amazon FSx options for replication to provide high availability and disaster recovery. Third-party solutions available in AWS Marketplace also support replication for Oracle databases. Both Oracle and third-party solutions can be used to replicate databases across Availability Zones within an AWS Region and across Regions. Oracle databases can be replicated from and to your on-premises data centers as well. AWS DMS can be used to replicate all or a subset of tables. As with any other database, compute and storage are the building blocks of a database system. In the next two sections we will cover key considerations on choosing Oracle Database AMIs and EBS storage before diving into the architecture patterns.
Oracle Database AMIs
AWS Marketplace offers many Oracle Database AMIs provided through various partners and vendors. Although you can build your own AMI, you can start an EC2 instance with an operating system AMI and then download and install Oracle Database software from the Oracle website just as you would do on premises. Oracle certifies a number of operating systems such as Red Hat Enterprise Linux, Oracle Linux, Microsoft Windows Server, and others. Refer to the Oracle documentation or the following AWS whitepaper to choose the right operating system for your Oracle database on Amazon EC2. In addition, customers should consider Oracle Database Lifecycle Policy and Operating system lifecycle policy when deploying Oracle databases on EC2.
Common Amazon EBS storage considerations
When using Amazon EBS storage, consider the following:
- Before designing the storage layout of Oracle Database on Amazon EC2, we recommend familiarizing yourself with IOPS and throughput offered by Amazon EBS volume types and learning to calculate the baseline and burstable IOPS and throughput for these volumes. EC2 instances also have IOPS and throughput limits. For more details, see Amazon EBS-optimized instance types. When running an Oracle database on Amazon EC2, you use EBS volumes for database storage.
- If your workload requires low latency and high IOPs, consider choosing EBS Provisioned IOPS SSD (io1/io2 Block Express) as the preferred option if it’s available in the region. The throughput depends on the average I/O size and EC2 instance limits. It also depends on the IOPS provisioned at volume level. e.g. a 12000 IOPS io2 volume gets you a maximum of 3000MB/s throughput, while for io1 it gives you a maximum of 500MB/s. Both io1 and io2 block express volumes allow modification of size and input/output operations per second (IOPS).
- For production and general-purpose workloads that require consistent and predictable performance, choose GP3. The GP3 volume allows you to modify IOPS and offers a good balance between price and performance.
- With io2 Block Express volumes, your database workloads benefit from consistent sub-millisecond latency, enhanced durability by 100 times over io1 volumes (99.999% for io2 compared to 99.9% for io1), and 20 times more IOPS/GiB from provisioned storage (up to 1,000 IOPS per GiB) at the same price as io1. Refer to this blog post for details on benefits of migrating to io2 Block Express, and benchmark latency results.
- With Amazon EBS Elastic Volumes, you can increase your EBS volume size and change IOPS or volume type, but you are still limited by the maximum size and IOPS supported by individual EBS volumes. To achieve higher IOPS and throughput, you can use Linux Volume Manager (LVM) to create Linux file systems with striping across multiple EBS volumes or Oracle Automatic Storage Management (Oracle ASM). With Oracle ASM, you can use multiple EBS volumes for creating Oracle ASM disk groups.
In the next sections, we walk through different architecture patterns covering both compute and storage configurations to meet your RPO and RTO requirements. Based on the workload and SLA requirements, you can mix and match these patterns to design resiliency for your database according to your workload requirements.
Pattern 1: Oracle Database on Amazon EC2 in a single Availability Zone, file system-based storage on Amazon EBS, and backups to Amazon S3
To begin, let’s consider a less critical workload or dev/test instance that doesn’t fall under strict HA and DR SLAs and needs to be more cost-efficient. In the following configuration, we have a bastion host or a web application configured in a public subnet and Oracle Database in a private subnet. Oracle Database is installed on a single EC2 instance. Amazon EC2 offers the broadest and deepest compute choice from general purpose, compute intensive, memory intensive, and more instance types. In general, Oracle Database workloads are memory intensive, so a wide choice of memory optimized instance types along with general purpose are suitable for Oracle databases. Amazon EC2 also offers EC2 dedicated hosts and bare metal instances for certain use cases.
The physical storage for an Oracle database consists of a set of files (data, temp, redo, control files, and so on) that are stored on disk. You can either use an operating system file system or Logical Volume Manager (LVM) for creating and managing these files. For simple databases, you might just use a single EBS volume for database storage. To store the database files, you partition it and create file systems. When you create an EBS volume, it is automatically replicated within its Availability Zone to prevent data loss due to failure of any single hardware component. In this example pattern, we create volume groups and add the EBS volumes to the volume groups. Then we create logical volumes from the volume groups and create file systems on top of the logical volumes.
With Amazon EBS Elastic Volumes, you can increase your EBS volume size and change IOPS or volume type while the volume is in use. After every modification, you will have to wait at least 6 hours before you can modify the same volume again.
This is a standard pattern suitable for dev/test databases and production database workloads that don’t require strict RPO or RTO. You can use Oracle native tools, like Oracle Data Pump and Oracle Recovery Manager (RMAN), to satisfy data protection, disaster recovery, and compliance requirements. In this simple pattern, we use AWS Backup to back up the file system and use RMAN to backup the archive logs to Amazon Simple Storage Service (Amazon S3), which is an efficient method to perform Oracle database point-in-time recovery in the event of a disaster. The same backup from in-Region Amazon S3 can be copied to an S3 bucket in a secondary Region for cross-Region disaster recovery as well. Refer to Using AWS Backup and Oracle RMAN for backup/restore of Oracle databases on Amazon EC2: Part 1 for more details on configuring backups. Because there is no standby Oracle database instance in the DR region, this option is also cost-efficient.
Pattern 2: Oracle Database on Amazon EC2 in one Availability Zone, Data Guard standby in a second Availability Zone with synchronous or asynchronous replication, and multiple EBS volumes
Oracle Data Guard and Active Data Guard provide high availability and disaster recovery during maintenance operations and in case of outages. Data Guard is a feature of the Oracle Database Enterprise Edition itself and doesn’t require separate licensing. On the other hand, Active Data Guard is a Oracle Database Enterprise Edition Option and requires separate licensing. Refer to the Oracle documentation for more details on Active Data Guard, which includes a number of features and functionalities. Data Guard maintains these standby databases as transactionally consistent copies of the production database. Then, if the production database becomes unavailable because of a planned or an unplanned outage, Data Guard can switch the role of standby database to the primary role, minimizing the downtime associated with the outage.
Depending on the distance between the primary and secondary databases and the application’s tolerance for latency, you can configure synchronous or asynchronous replication. With synchronous replication you can achieve 0 RPO with a trade off on performance. Whereas, asynchronous replication has a trade off with non-zero RPO.
In the following configuration, the primary Oracle database is configured on Amazon EC2 in one Availability Zone and a standby database is in another Availability Zone within the same Region. The standby database is replicated from the primary using Data Guard replication. Based on your RTO and RPO requirements, you can set up Data Guard with synchronous or asynchronous replication.
The directory structure for database installation requires several file systems. The file systems used for storing Oracle software binaries, trace, and log files are critical for operations but don’t have heavy performance requirements as compared to the data and log files. As a best practice, consider the following:
- Data and log files should be on separate volumes.
- Copies of control files should be stored on file systems that are created on separate volumes.
- Using a dedicated EBS volume for a swap file simplifies management and improves system stability.
- Redo log files are written sequentially by the Oracle database instance Log Writer (LGWR) process. Log file systems must be designed to support such I/O activity. For example, EBS volume IOPs and throughput should be closely monitored to make sure it’s not hitting the threshold.
Pattern 3: Primary DB instance on Amazon EC2 in one Availability Zone, active Data Guard standby in a second Availability Zone, multiple EBS volumes with Oracle ASM
The following architecture pattern is an extension of pattern 2, where we enable Oracle Data Guard Fast-Start Failover (FSFO) for managing automatic failovers and configuring Oracle ASM on top of EBS volumes for volume management, striping, predictable performance, and other Oracle ASM capabilities.
FSFO can provide more resiliency by setting up the Data Guard broker on a separate machine (observer). FSFO increases the availability of the database by eliminating the need for manual involvement as part of the failover process. With Oracle Database version 12.2 and above, it’s also possible to configure multiple observers with a single Data Guard broker configuration. In the preceding configuration, you create your primary Oracle database in one Availability Zone, a Data Guard standby instance in a second Availability Zone, and configure FSFO on an EC2 instance in a third Availability Zone for maximum availability. Placing them in separate Availability Zones is not a requirement but a best practice for increased resiliency.
In addition to FSFO, this option includes Oracle ASM configuration on top of EBS volumes. Oracle ASM is a high-performance database volume manager that simplifies the storage management of Oracle Database. In addition, it provides predictable performance, availability, and scalability. A disk group in Oracle ASM comprises of one or more disks that distributes or stripes data across disks to provide uniform performance. You can create separate disk groups for redo logs and data. In the preceding configuration, we have created a redo disk group for redo logs with two disks and a data disk group with three disks. It’s important to make sure the disk size, I/O, and throughput of the EBS volume in an Oracle ASM disk group are identical. This is because when you create a tablespace on Oracle ASM, Oracle writes the data proportionally on a per-disk basis. In case of EBS volumes with varying size, data proportion across disks will not be same, which leads to performance degradation. For more details on implementing and managing Oracle ASM with EBS volumes, refer to Using Amazon EBS elastic volumes with Oracle databases (part 3): databases using Oracle ASM.
Oracle ASM also supports multiple redundancy types, with common ones being NORMAL, EXTERNAL, and HIGH. If you specify mirroring for a file, Oracle ASM automatically stores redundant copies of the file extents in separate failure groups. Failure groups apply to normal, high, flex, and extended redundancy disk groups. You can define the failure groups for each disk group when you create or alter the disk group. Refer to Oracle ASM Mirroring and Disk Group Redundancy for details on configuring failure groups.
Having Oracle ASM on top of EBS volumes will group the EBS volumes in a disk group. And tablespaces created on an Oracle ASM disk group will distribute the data across EBS volumes. This provides more flexibility for storage management while creating new tablespaces or updating existing ones.
Although Oracle ASM provides features for volume management and performance, it might not be ideal for smaller workloads and database with static data that doesn’t change much. This setup requires additional overhead of installing Grid Infrastructure (GI) and patching, which can increase administrative tasks for a DBA. With Oracle ASM, direct access to data files is prohibited, which means the only way to back up would be RMAN. Although Oracle ASM does provide simplified storage management, it requires more involvement. Therefore, having standard Amazon EBS for a database with small size or non-production would be price performant.
Pattern 4: Cross-Region disaster recovery with Data Guard
A critical database with minimal RTO and RPO requirements can adopt a slightly complex but reliable architecture to provide high availability and data protection. Enterprises looking to have additional resiliency can replicate data using Data Guard to a different Region.
However, it’s important to understand the data transfer charges between Availability Zones and Regions while architecting highly available Oracle Database on Amazon EC2. Details regarding data transfer within the same AWS Region are highlighted in this document.
The design as shown in the following figure involves having two standby databases configured using Data Guard. One has synchronous redo transport (SYNC), which can be configured with Maximum Protection or Maximum Availability mode in a different Availability Zone but within the same Region. The second standby database is configured with asynchronous redo transport (ASYNC) in Maximum Performance data protection mode. Refer to Oracle Data Guard Protection Modes for specifics on each mode.
Even though Availability Zones in AWS are isolated locations, the latency between them is minimal, which makes it ideal for Data Guard Maximum Protection or Maximum Availability modes. However, when replicating data to a different Region, Maximum Performance might be better suited because of (network) latency. You can choose either protection mode irrespective of the Region or Availability Zone, but the decision should include evaluating the network latency to avoid blocking I/O on the primary database.
This solution involves higher cost because of multiple standby instances across availability zones and regions and is ideal for critical database workloads with minimal RTO and RPO requirements.
Backup considerations while running Oracle databases on Amazon EC2
The following are some important considerations for taking database backup and restore when running Oracle databases on Amazon EC2:
- When running Oracle databases on Amazon EC2, you can take database and schema backups using Oracle native tools like Data Pump and RMAN to satisfy data protection, disaster recovery, and compliance requirements.
- For taking backups, there are multiple strategies, the most common ones being:
- Back up the database using RMAN onto Amazon EBS and then copy it over to Amazon S3.
- Use AWS Backup to back up an EC2 instance followed by RMAN archive log backup to Amazon S3 using Oracle Secure Backup.
- Use RMAN for database and archive log backup to Amazon Elastic File System (Amazon EFS).
- For long-term backup requirements for security, compliance, or business continuity needs, you can move the RMAN backups to Amazon S3 either programmatically or schedule them using AWS Backup.
- If you have a Data Guard standby instance, it’s always recommended to schedule backups from the standby database to offload the backup overhead away from the primary database. These backups can be used for disaster recovery or for creating a clone of the primary database for testing or development purposes.
Summary
In this post, we described multiple architecture patterns available for Oracle databases on Amazon EC2 to meet your resiliency requirements. It’s important to start with the RPO and RTO requirements for each database workload and then choose the most viable HA and DR option. We also discussed Amazon EBS and AWS Backup considerations when running these Oracle databases on Amazon EC2.
Share your feedback with us in the comments section.
About the Authors
Manash Kalita is a Senior Database Specialist Solutions Architect with Amazon Web Services. He works with AWS customers designing customer solutions on database projects, helping them migrate and modernize their existing databases to the AWS Cloud as well as orchestrate large-scale migrations in AWS.
Yamuna Palasamudram is a Principal Database Specialist Solutions Architect with Amazon Web Services. She works with AWS RDS team, focusing on commercial database engines like Oracle. She enjoys working with customers to help design, deploy, migrate, and optimize relational database workloads on AWS.
Viqash Adwani is a Senior Cloud Architect with Amazon Web Services. He works with internal and external Amazon customers to build secure, scalable, and resilient architectures in the AWS Cloud and help customers perform migrations from on-premises databases to Amazon RDS and Amazon Aurora databases.
Source: Read More