Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      The Power Of The Intl API: A Definitive Guide To Browser-Native Internationalization

      August 8, 2025

      This week in AI dev tools: GPT-5, Claude Opus 4.1, and more (August 8, 2025)

      August 8, 2025

      Elastic simplifies log analytics for SREs and developers with launch of Log Essentials

      August 7, 2025

      OpenAI launches GPT-5

      August 7, 2025

      3 portable power stations I travel everywhere with (and how they differ)

      August 9, 2025

      I tried Lenovo’s new rollable ThinkBook and can’t go back to regular-sized screens

      August 9, 2025

      The Creators of the Acclaimed Silent Hill 2 Remake Present a Deep Dive Into the Story of Their Newest Horror Game IP — and It’s So Bizarre and Insane That It’s Convinced Me To Put It on My Wishlist

      August 9, 2025

      Forget Back to School Deals — Lenovo’s Clearance Sale is Where You’ll Find Amazing Discounts on Laptops, Mini PCs, and More, While Supplies Last

      August 9, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      spatie/laravel-flare

      August 9, 2025
      Recent

      spatie/laravel-flare

      August 9, 2025

      Establishing Consistent Data Foundations with Laravel’s Database Population System

      August 8, 2025

      Generate Postman Collections from Laravel Routes

      August 8, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      The Creators of the Acclaimed Silent Hill 2 Remake Present a Deep Dive Into the Story of Their Newest Horror Game IP — and It’s So Bizarre and Insane That It’s Convinced Me To Put It on My Wishlist

      August 9, 2025
      Recent

      The Creators of the Acclaimed Silent Hill 2 Remake Present a Deep Dive Into the Story of Their Newest Horror Game IP — and It’s So Bizarre and Insane That It’s Convinced Me To Put It on My Wishlist

      August 9, 2025

      Forget Back to School Deals — Lenovo’s Clearance Sale is Where You’ll Find Amazing Discounts on Laptops, Mini PCs, and More, While Supplies Last

      August 9, 2025

      The Gaming Desktop I’ve Relied on More Than Any Other Is More Powerful and Sleeker Than Ever — But Damn, It’s Expensive

      August 9, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Lakeflow: Revolutionizing SCD2 Pipelines with Change Data Capture (CDC)

    Lakeflow: Revolutionizing SCD2 Pipelines with Change Data Capture (CDC)

    June 21, 2025

    Several breakthrough announcements emerged at DAIS 2025, but the Lakeflow updates around building robust pipelines had the most immediate impact on my current code. Specifically, I can now see a clear path to persisting SCD2 (Slowly Changing Dimension Type 2) tables in the silver layer from mutable data sources. If this sentence resonates with you, we share a common challenge. If not, it soon will.

    Maintaining history through Change Data Capture is critical for both AI and foundational use cases like Single View of the Customer. However, the current state of Delta Live Tables (DLT) pipelines only allows streaming tables to maintain SCD2 logic, while most data sources permit updates. Let’s dive into the technical challenges and how Lakeflow Connect is solving them.

    Slowly Changing Dimensions

    There are two options for managing changes: SCD1 and SCD2.

    1. SCD Type 1 is focused on keeping only the latest data. This approach involves overwriting old data with new data whenever a change occurs. No history of changes is kept, and only the latest version of the data is available. This is useful when the history of changes isn’t important, such as correcting errors or updating non-critical fields like customer email addresses or maintaining lookup tables.
    2. SCD Type 2 keeps the historical versions of data. This approach maintains a historical record of data changes by creating additional records to capture different versions of the data over time. Each version of the data is timestamped or tagged with metadata that allows users to trace when a change occurred. This is useful when it’s important to track the evolution of data, such as tracking customer address changes over time for analysis purposes.

    While basic operational reporting can support SCD1, almost any analytic approach will benefit from history. ML models suffer from lack of data, and AI will be more likely to hallucinate. Let’s look at a simple example.

    Monday Morning Dataset:

    idnamestate
    1JohnNY
    2JaneCA
    3JuanPA

    Tuesday Update: John moves from New York to New Jersey.

    idnamestate
    1JohnNJ
    2JaneCA
    3JuanPA
    • SCD1 Result: Overwrites John’s state, leaving only three records.
    • SCD2 Result: Retains John’s NY record and adds a new NJ record, resulting in four records.

    This important thing to understand here is that having John’s lifecycle is almost certainly valuable from an analytical perspective. This extremely small cost around storage is negligible compared to the potential lost opportunity of simply overwriting the data. I like to have SCD2 tables in the silver layer as a general rule in the medallion architecture. However, there were some issues with DLTs around this scenario.

    Challenges with the APPLY CHANGES API

    In the current state, SCD updates are managed through the APPLY CHANGES API. This API was more effective than Spark’s MERGE INTO statement. MERGE INTO is relatively straightforward until you start to factor in edge cases. For example, what if there are several updates to the same key in the same microbatch? What if the changes come in out of order? How do you handle DELETEs? Worse, how do you handle out-of-order DELETEs? However, APPLY CHANGES only worked for append-only data.

    In its current state, a DLT pipeline creates a Directed Acyclic Graph (DAG) for all the tables and views in the pipeline using the metadata of the resources. Only the metadata. In many pipelines, the data from the source RDBMS has already been ingested into bronze and is refreshed daily. Lets look at our sample dataset. On Monday, I run the DLT. While the DLT is aware of the metadata of the table, it does not have access to the contents. Imagine a MERGE statement where no current records exists. Everything is an insert. Now imagine processing the next day’s data. Again, since only the metadata is loaded into the DAG, the APPLY CHANGES has no prior record of John. Effectively, only SCD1 tables can be created from mutable data sources since the data will not be loaded at this time.

    The new Lakeflow process provides a mechanism where CDC can be used with the Lakeflow Connector to drive SCD2 semantics even with mutable data.

    What is Change Data Capture (CDC)?

    Change Data Capture (CDC) is a data integration pattern that captures changes in a source system, like inserts, updates and deletes, through a CDC feed. The CDC feed stores a list of changes rather than the whole dataset, providing a performance opportunity. Most transactional databases, like SQL Server, Oracle and MySQL, can generate CDC feeds automatically. When a row in the source table is updated, a new set of rows is created in the CDC feed that only has the changes, plus some metadata like UPDATE or DELETE as well as a column that can be used to deterministically identify order, like a sequence number. There is also an update to APPLY CHANGES called AUTO CDC INTO.

    AUTO CDC INTO

    There are actually two APIs: AUTO CDC and AUTO CDC FROM SNAPSHOT. They have the same syntax as APPLY CHANGES, but they can now correctly handle more use cases. You may have already guessed that AUTO CDC FROM SNAPSHOT has the same method signature as APPLY CHANGES FROM SNAPSHOT. However, the AUTO CDC API supports periodic ingestion of snapshots with each pipeline update.  Since data and not just metadata is made available to the API, there is sufficient information made available to the call to correctly populate the SCD2 dataset.

    Conclusion

    Lakeflow Connect is a game-changer for data engineers, enabling SCD2 tables in the silver layer even with mutable data sources. By leveraging CDC and the new AUTO CDC INTO API, you can maintain historical data accurately, ensuring your AI and ML models have the context they need to perform optimally.

    The future of data engineering is here, and it’s built on Lakeflow Connect.

    Contact us to learn more about how to empower your teams with the right tools, processes, and training to unlock Databricks’ full potential across your enterprise.

     

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleIt’s like surfing
    Next Article Test hints Microsoft Copilot may offer ChatGPT’s o4-mini-high for free

    Related Posts

    Development

    spatie/laravel-flare

    August 9, 2025
    Repurposing Protein Folding Models for Generation with Latent Diffusion
    Artificial Intelligence

    Repurposing Protein Folding Models for Generation with Latent Diffusion

    August 9, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2025-5757 – Traffic Offense Reporting System Cross-Site Scripting Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-5674 – “Code-Projects Patient Record Management System SQL Injection Vulnerability”

    Common Vulnerabilities and Exposures (CVEs)

    I’ve used the Samsung Galaxy S25 Ultra for 6 months – here’s my verdict now

    News & Updates

    CVE-2025-7412 – “Code-Projects Library System Unrestricted File Upload Vulnerability”

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    CVE-2025-38160 – Raspberry Pi Linux Kernel NULL Pointer Dereference Vulnerability

    July 3, 2025

    CVE ID : CVE-2025-38160

    Published : July 3, 2025, 9:15 a.m. | 2 hours, 14 minutes ago

    Description : In the Linux kernel, the following vulnerability has been resolved:

    clk: bcm: rpi: Add NULL check in raspberrypi_clk_register()

    devm_kasprintf() returns NULL when memory allocation fails. Currently,
    raspberrypi_clk_register() does not check for this case, which results
    in a NULL pointer dereference.

    Add NULL check after devm_kasprintf() to prevent this issue.

    Severity: 0.0 | NA

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    From Silos to Synergy: Accelerating Your AI Journey

    July 8, 2025

    CVE-2025-27700 – “Qualcomm Carrier Restrictions Bypass Local Privilege Escalation”

    May 27, 2025

    CVE-2025-5939 – Telegram for WP WordPress Stored Cross-Site Scripting Vulnerability

    June 13, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.