Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      BrowserStack launches Figma plugin for detecting accessibility issues in design phase

      July 22, 2025

      Parasoft brings agentic AI to service virtualization in latest release

      July 22, 2025

      Node.js vs. Python for Backend: 7 Reasons C-Level Leaders Choose Node.js Talent

      July 21, 2025

      Handling JavaScript Event Listeners With Parameters

      July 21, 2025

      I finally gave NotebookLM my full attention – and it really is a total game changer

      July 22, 2025

      Google Chrome for iOS now lets you switch between personal and work accounts

      July 22, 2025

      How the Trump administration changed AI: A timeline

      July 22, 2025

      Download your photos before AT&T shuts down its cloud storage service permanently

      July 22, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Laravel Live Denmark

      July 22, 2025
      Recent

      Laravel Live Denmark

      July 22, 2025

      The July 2025 Laravel Worldwide Meetup is Today

      July 22, 2025

      Livewire Security Vulnerability

      July 22, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Galaxy Z Fold 7 review: Six years later — Samsung finally cracks the foldable code

      July 22, 2025
      Recent

      Galaxy Z Fold 7 review: Six years later — Samsung finally cracks the foldable code

      July 22, 2025

      Halo and Half-Life combine in wild new mod, bringing two of my favorite games together in one — here’s how to play, and how it works

      July 22, 2025

      Surprise! The iconic Roblox ‘oof’ sound is back — the beloved meme makes “a comeback so good it hurts” after three years of licensing issues

      July 22, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Databases»Real-time Iceberg ingestion with AWS DMS

    Real-time Iceberg ingestion with AWS DMS

    June 4, 2025

    This is a guest post by Caius Brindescu, Principal Engineer at Etleap, in partnership with AWS.

    Timely decision-making depends on low-latency access to the freshest data. But for many teams, reducing latency into a data lake means wrestling with update handling, pipeline complexity, and warehouse integration. Apache Iceberg changes that equation by making real-time ingestion and multi-engine access practical and efficient.

    Etleap is an AWS Advanced Technology Partner with the AWS Data & Analytics Competency and Amazon Redshift Service Ready designation. In this post, we show how Etleap helps you build scalable, near real-time pipelines that stream data from operational SQL databases into Iceberg tables using AWS Database Migration Service (AWS DMS). You can use AWS DMS as a robust and configurable solution for change data capture (CDC) from all major databases into AWS.

    Drawing from real-world use cases, we show how we provide consistency, fault tolerance, integration with cloud data warehouses—so your data can drive action when it matters most.

    Introduction to AWS DMS

    AWS DMS is a cloud service that simplifies the migration of relational databases, data warehouses, NoSQL databases, and other types of data stores. You can use AWS DMS to migrate your data into the AWS Cloud or between combinations of cloud and on-premises setups.

    With AWS DMS, you can perform one-time migrations, and you can replicate ongoing changes to keep sources and targets in sync in near to real time depending on different factors. There might be a small delay between when a change is made in the source and when it’s replicated to the target. This delay, or latency, can be influenced by various factors, including AWS DMS settings, network bandwidth, and the load on both the source and target databases.

    The following diagram illustrates the AWS DMS replication process.

    Introduction to Iceberg

    Iceberg is an open table format designed for large-scale analytics on data lakes. It brings full support for ACID transactions, hidden partitioning, schema evolution, and time travel—without requiring expensive table rewrites or downtime.

    Unlike older formats like Hive, Iceberg tracks metadata separately from data files, enabling fast snapshot reads and efficient file-level updates. Data files are immutable, and changes are applied by creating new snapshots, making operations like upserts and deletes safe and performant. The following diagram illustrates the file layout for an Iceberg table.

    Iceberg is an engine-agnostic data storage format and integrates with Spark, Apache Flink, Trino (including services like Amazon Athena), and Hive. It’s a strong foundation for building production-grade, low-latency data lake architectures that can serve multiple downstream systems reliably. By querying the catalog, each engine will have a consistent view of the data being stored.

    Why Etleap customers need Iceberg

    For Etleap customers, low latency is a top priority. We define latency as the time lag between the source and destination data. For example, if the destination is 5 minutes behind the source, then latency is 5 minutes.

    Traditional data warehouses and data lakes impose limits on how low you can drive latency. In warehouses, each load lands new data in a staging area and then runs a merge that scans much of the target table before rewriting it. Touching the data twice- once to read, once to rewrite – is what adds the bulk of the delay.

    Data lakes, on the other hand, are typically append-only and don’t handle updates efficiently. You can either store all row versions and resolve the latest at query time—sacrificing query performance—or precompute the latest state at load time, which requires expensive table rewrites and adds latency.

    Iceberg solves both of these challenges for Etleap users. In one case, a European bike-sharing company uses an operational dashboard to monitor bike rack availability and comply with regulations that limit how long a rack can remain empty. By reducing pipeline latency from 5–10 minutes to just a few seconds, Iceberg enables more real-time data, giving operators critical extra time to rebalance racks and avoid penalties.

    In another case, a team needed to make a single dataset available across multiple data warehouses. Previously, this meant building and maintaining separate pipelines for each destination. With Iceberg, they can load data one time and support querying from multiple engines, reducing complexity and providing consistency across tools.

    Solution overview

    The first step is extracting updates from source databases as quickly as possible. For this, we use AWS DMS, which reliably scales to high-throughput data capture (CDC). AWS DMS writes changes to Amazon Kinesis Data Streams. Etleap reads the data stream and processes the changes using Flink on Amazon EMR. From there, data is written to Iceberg tables stored in Amazon S3, and AWS Glue is used as the data catalog. This allows for the data to be queried from multiple query engines. The following diagram illustrates the pipeline architecture.

    Exactly-once processing

    To maintain data integrity in low-latency pipelines, it’s not enough to move data quickly—it must also be processed exactly once. Duplicate or missing records can lead to incorrect results, especially in update-heavy workloads. After changes are captured by AWS DMS and streamed into Kinesis Data Streams, Etleap provides exactly-once processing using Flink’s two-phase commit protocol.

    In phase one, Flink injects a checkpoint barrier into each parallel stream of data (see the following figure). As the barrier flows through the pipeline, each operator pauses, synchronously saves its state, and then forwards the barrier downstream. After that, it asynchronously persists the state snapshot to durable storage—in our case, Amazon S3. This separation makes sure critical processing isn’t blocked while waiting on I/O. After the operators have successfully saved their state, they notify the Flink Job Manager that they’ve completed the checkpoint.

    In phase two, the Job Manager confirms that all parts of the pipeline have completed the checkpoint, then broadcasts a global commit signal (see the following diagram). At this point, operators can safely perform side effects, such as writing data to external systems. In Etleap’s case, this includes committing a new snapshot to the Iceberg table and recording metrics for monitoring.

    To support fault tolerance, commit operations are implemented to be idempotent. This applies both to writing a new snapshot to the Iceberg table and recording pipeline metrics and logs. If the pipeline fails during this stage and needs to restart, Flink will safely reattempt the commit. Because each operation can run multiple times without producing duplicates or inconsistencies, we maintain data correctness even in the face of failure.

    The final piece of the puzzle is checkpoint frequency—how often the pipeline commits data to Iceberg. This setting plays a critical role in determining overall pipeline latency. After evaluating our use cases, we’ve chosen a checkpoint interval of 3 seconds. This strikes an effective balance: it keeps end-to-end latency under 5 seconds while minimizing the performance overhead introduced by frequent commits. This interval also aligns well with the warehouse metadata refresh cycle, as discussed in the following section.

    Table maintenance and warehouse integration

    Over time, Iceberg tables accumulate metadata, snapshots, and suboptimal file layouts that can degrade performance. To keep query performance high and storage efficient, regular maintenance is essential. This includes tasks like data file compaction, snapshot expiration, and metadata cleanup. Etleap handles these maintenance tasks automatically, without interrupting ingestion. Maintenance jobs run in parallel with data pipelines and are triggered based on heuristics like file size distribution and snapshot count.

    The following screenshot shows an example of the Etleap pipeline showing parallel maintenance activities being run, without interrupting the flow of data.

    The final piece is warehouse integration. One of Iceberg’s key advantages is interoperability: the same table can be queried from engines like Athena, Amazon Redshift, Snowflake, and Databricks.

    Although manual setup is possible, Etleap can configure these integrations automatically when a pipeline is created. For example, to make an Iceberg table queryable in Snowflake, we create a catalog integration with AWS Glue and define an external volume pointing to the table’s Amazon S3 location:

    CREATE CATALOG INTEGRATION ETLEAP_ICEBERG_CATALOG
    CATALOG_SOURCE = GLUE
    GLUE_AWS_ROLE_ARN = 'arn:aws:iam::123456789012:role/etleap-streaming-ingestion-snowflake'
    GLUE_CATALOG_ID = '123456789012'
    CATALOG_NAMESPACE = 'iceberg_lake'
    GLUE_REGION = 'us-east-1'
    TABLE_FORMAT = ICEBERG
    ENABLED = TRUE;
    CREATE EXTERNAL VOLUME ETLEAP_ICEBERG_EXTERNAL_VOLUME
    STORAGE_LOCATIONS = ((
    	NAME = 'etleap_iceberg_bucket'
    	STORAGE_PROVIDER = 'S3'
    	STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::123456789012:role/etleap-streaming-ingestion-snowflake'
    	STORAGE_BASE_URL = 's3://<bucket_name>/iceberg'
    ))
    ALLOW_WRITES = FALSE;

    Then we can then create a new table in Snowflake that points to the Iceberg table:

    CREATE OR REPLACE ICEBERG TABLE "product_order"
    EXTERNAL_VOLUME = ETLEAP_ICEBERG_EXTERNAL_VOLUME
    CATALOG = ETLEAP_ICEBERG_CATALOG
    CATALOG_NAMESPACE = 'iceberg_lake'
    CATALOG_TABLE_NAME = 'product_order';

    To keep Snowflake in sync with the latest Iceberg snapshots, Etleap triggers a REFRESH operation after each commit. This makes sure users see the freshest data and prevents the view in Snowflake from drifting too far behind. The synchronous nature of the refresh also provides a natural rate-limiting mechanism, aligning snapshot visibility with warehouse query performance.

    Conclusion

    Building low-latency, reliable data pipelines into Iceberg is achievable using AWS tools like AWS DMS, Kinesis Data Streams, and Amazon EMR, combined with Iceberg’s support for updates, schema evolution, and multi-engine access. In this post, we showed how to stream changes from operational databases into Iceberg with end-to-end latencies under 5 seconds—while preserving data integrity and supporting downstream analytics in tools like Snowflake.This architecture offers a powerful foundation for teams looking to modernize their data lakes, unify access across engines, and meet real-time operational requirements.

    To see how Etleap makes this possible out of the box, including automatic pipeline creation, fault-tolerant processing, and Iceberg maintenance, sign up for a personalized demo.


    Caius Brindescu

    Caius Brindescu

    Caius is a Principal Engineer at Etleap with 20+ years of experience in software development. He specializes in Java backend development and big data technologies, including working with systems like Hadoop, Spark, and Flink. He holds a PhD from Oregon State University and an AWS certification Big Data – Specialty certification.

    Mahesh Kansara

    Mahesh Kansara

    Mahesh is a database engineering manager at Amazon Web Services. He closely works with development and engineering teams to improve migration and replication services. He also works with our customers to provide guidance and technical assistance on various database and analytical projects, helping them improve the value of their solutions when using AWS.

    Source: Read More

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleHow to Onboard Your Clients With WordPress
    Next Article Migrate Google Cloud SQL for PostgreSQL to Amazon RDS and Amazon Aurora using pglogical

    Related Posts

    Development

    Laravel Live Denmark

    July 22, 2025
    Development

    The July 2025 Laravel Worldwide Meetup is Today

    July 22, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    dgb-n64 is an experimental low-level N64 emulator

    Linux

    Getting Clarity on Apple’s Liquid Glass

    News & Updates

    Catchpoint adds OpenTelemetry-based real-user monitoring for mobile devices

    Tech & Work

    These $60 headphones have no business sounding this good (and they’re on sale)

    News & Updates

    Highlights

    Machine Learning

    Manage multi-tenant Amazon Bedrock costs using application inference profiles

    July 18, 2025

    Successful generative AI software as a service (SaaS) systems require a balance between service scalability…

    CVE-2015-10140 – WordPress Ajax Load More Unauthenticated File Upload/Deletion Vulnerability

    July 22, 2025

    CVE-2024-56342 – IBM Verify Identity Access Digital Credentials Information Disclosure

    June 5, 2025

    AI Agents and the Non‑Human Identity Crisis: How to Deploy AI More Securely at Scale

    May 27, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.