Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Microsoft adds Copilot-powered debugging features for .NET in Visual Studio

      August 21, 2025

      Blackstone portfolio company R Systems Acquires Novigo Solutions, Strengthening its Product Engineering and Full-Stack Agentic-AI Capabilities

      August 21, 2025

      HoundDog.ai Launches Industry’s First Privacy-by-Design Code Scanner for AI Applications

      August 21, 2025

      The Double-Edged Sustainability Sword Of AI In Web Design

      August 20, 2025

      How VPNs are helping people evade increased censorship – and much more

      August 22, 2025

      Google’s AI Mode can now find restaurant reservations for you – how it works

      August 22, 2025

      Best early Labor Day TV deals 2025: Save up to 50% on Samsung, LG, and more

      August 22, 2025

      Claude wins high praise from a Supreme Court justice – is AI’s legal losing streak over?

      August 22, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Preserving Data Integrity with Laravel Soft Deletes for Recovery and Compliance

      August 22, 2025
      Recent

      Preserving Data Integrity with Laravel Soft Deletes for Recovery and Compliance

      August 22, 2025

      Quickly Generate Forms based on your Eloquent Models with Laravel Formello

      August 22, 2025

      Pest 4 is Released

      August 22, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      FOSS Weekly #25.34: Mint 22.2 Features, FreeVPN Fiasco, Windows Update Killing SSDs, AI in LibreOffice and More

      August 21, 2025
      Recent

      FOSS Weekly #25.34: Mint 22.2 Features, FreeVPN Fiasco, Windows Update Killing SSDs, AI in LibreOffice and More

      August 21, 2025

      You’ll need standalone Word, PowerPoint, Excel on iOS, as Microsoft 365 app becomes a Copilot wrapper

      August 21, 2025

      Microsoft to Move Copilot Previews to iOS While Editing Returns to Office Apps

      August 21, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»News & Updates»From Facts & Metrics to Media Machine Learning: Evolving the Data Engineering Function at Netflix

    From Facts & Metrics to Media Machine Learning: Evolving the Data Engineering Function at Netflix

    August 21, 2025

    By Dao Mi, Pablo Delgado, Ryan Berti, Amanuel Kahsay, Obi-Ike Nwoke, Christopher Thrailkill, and Patricio Garza

    At Netflix, data engineering has always been a critical function to enable the business’s ability to understand content, power recommendations, and drive business decisions. Traditionally, the function centered on building robust tables and pipelines to capture facts, derive metrics, and provide well modeled data products to their partners in analytics & data science functions. But as Netflix’s studio and content production scaled, so too have the challenges — and opportunities — of working with complex media data.

    Today, we’re excited to share how our team is formalizing a new specialization of data engineering at Netflix: Media ML Data Engineering. This evolution is embodied in our latest collaboration with our platform teams, the Media Data Lake, which is designed to harness the full potential of media assets (video, audio, subtitles, scripts, and more) and enable the latest advances in machine learning, including latest transformer model architecture. As part of this initiative, we’re intentionally applying data engineering best practices — ensuring that our approach is both innovative and grounded in proven methodologies.

    The Evolution: From Traditional Tables to Media Tables

    Traditional data engineering at Netflix focused on building structured tables for metrics, dashboards, and data science models. These tables were primarily structured text or numerical fields, ideal for business intelligence, analytics and statistical modeling.

    However, the nature of media data is fundamentally different:

    • It’s multi-modal (video, audio, text, images).
    • It contains derived fields from media (embeddings, captions, transcriptions…etc)
    • It’s unstructured and massive in scale when parsed out.
    • It’s deeply intertwined with creative workflows and business asset lineage.

    As our studio operations (see below) expanded, we saw the need for a new approach — one that could provide centralized, standardized, and scalable access to all types of media assets and their metadata for both analytical and machine learning workflows.

    The Rise of Media ML Data Engineering

    Enter Media ML Data Engineering — a new specialization at Netflix that bridges the gap between traditional data engineering and the unique demands of media-centric machine learning. This role sits at the intersection of data engineering, ML infrastructure, and media production. Our mission is to provide seamless access to media assets and derived data (including outputs from machine learning models) for researchers, data scientists, and other downstream data consumers.

    Key Responsibilities

    • Centralized Media Data Access: Building, cataloging and maintaining the data and pipelines that populates the Media Data Lake, a data platform for storing and serving media assets and their metadata.
    • Asset Standardization: Standardizing media assets across modalities (video, images, audio, text) to ensure consistency and quality for ML applications in partnership with domain engineering teams.
    • Metadata Management: Unifying and enriching asset metadata, making it easier to track asset lineage, quality, and coverage.
    • ML-Ready Data: Exposing large corpora of assets for early-stage algorithm exploration, benchmarking, and productionization.
    • Collaboration: Partnering closely with domain experts, algorithm researchers, upstream content engineering teams and (machine learning & data) platform colleagues to ensure our data meets real-world needs.

    This new role is essential for bridging the gap between creative media workflows and the technical demands of cutting-edge ML.

    Introducing the Media Data Lake

    To enable the next generation of media analytics and machine learning, we are building the Media Data Lake at Netflix — a data lake designed specifically for media assets at Netflix using LanceDB. We have partnered with our data platform team on integrating LanceDB into our Big Data Platform.

    Architecture and Key Components

    • Media Table: The core of the Media Data Lake, this structured dataset captures essential metadata and references to all media assets. It’s designed to be extensible, supporting both traditional metadata and outputs from ML models (including transformer-based embeddings, media understanding research and more).
    • Data Model: We are developing a robust data model to standardize how media assets and their attributes are represented, making it easier to query and join across schemas.
    • Data API: An pythonic interface that will provide programmatic access to the Media Table, supporting both interactive exploration and automated workflows.
    • UI Components: Off-the-shelf UI interfaces enable teams to visually explore assets in the media data lake, accelerating discovery and iteration for ICs.
    • Online and Offline System Architecture: Real-time access for lightweight queries and exploration of raw media assets; scalable large batch processing for ML training, benchmarking, and research.
    • Compute: distributed batch inference layer capable of processing using GPUs and media data processing at scale using CPUs.

    Starting Small with New Technology

    Our initial focus this past year has been on delivering a “data pond” — a mini-version of the Media Data Lake targeted at video/audio datasets for early stage model training, evaluation and research. All data for this phase comes from AMP, our internal asset management system and annotation store, and the scope is intentionally small to ensure a solid, extensible foundation could be built while introducing a new technology into the company. We are able to perform data exploration of the raw media assets to build up an intuitive understanding of the media via lightweight queries to AMP.

    Media Tables: The New Foundation for ML and Innovation

    One of the most exciting developments is the rise of media tables — structured datasets that not only capture traditional metadata, but also include the outputs of advanced ML models.

    These media tables power a range of innovative applications, such as:

    • Translation & Audio Quality Measures: Managing audio clips and features via text-to-speech models for engineering localization quality metrics.
    • Media Fidelity Restoration: Research on restoration of videos to HDR for remastering and other image technology use-cases.
    • Story Understanding and Content Embedding: Structuring narrative elements extracted from textual evidence and video of a title to increase operational efficiency in title launch preparation and ratings, e.g. detection of smoking, gore, NSFW scenes in our titles.
    • Media Search: Leverage multi-modal vector search to find similar keyframes, shots, dialogue to facilitate research and experimentation.

    These tables built on top of LanceDB are designed to scale, support complex queries, and serve both research and other data science & analytical needs.

    The Human Side: New Roles and Collaboration

    Media ML Data Engineering is a team sport. Our data engineers partner with domain experts, data scientists, ML researchers, upstream business ops and content engineering teams to ensure our data solutions are fit for purpose. We also work closely with our friendly platform teams to ensure technological breakthroughs that are beneficial beyond our small corner of the universe could become horizontal abstractions that benefit the rest of Netflix. This collaborative model enables rapid iteration, high data quality, innovative use cases and technology re-use.

    Looking Ahead

    The evolution from traditional data engineering to Media ML data engineering — anchored by our media data lake — is unlocking new frontiers for Netflix:

    • Richer, more accurate ML models trained on high-quality, standardized media data.
    • Supercharge ML Model evaluations via quick iteration cycles on the data.
    • Faster experimentation and productization of new AI-powered features.
    • Deeper insights into our content and creative workflows via metrics constructed from Media ML algorithms inferred features.

    As we continue to grow the media data lake, be on the lookout for subsequent blog posts sharing our learnings and tools with the broader media ml & data engineering community.


    From Facts & Metrics to Media Machine Learning: Evolving the Data Engineering Function at Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleThunderbird Shares New Details on Upcoming Pro Features
    Next Article Explore the best of GitHub Universe: 9 spaces built to spark creativity, connection, and joy

    Related Posts

    News & Updates

    How VPNs are helping people evade increased censorship – and much more

    August 22, 2025
    News & Updates

    Google’s AI Mode can now find restaurant reservations for you – how it works

    August 22, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2025-22157 – Atlassian Jira Privilege Escalation Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-49183 – Apache HTTP Unencrypted Communication Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-32019 – Harbor Cross-Site Scripting (XSS) Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Copilot on Windows Gets Smarter With Semantic Search and New Home Experience

    Operating Systems

    Highlights

    CVE-2025-38343 – “TP-Link MT76 WiFi Driver Multicast Broadcast RA Fragmentation Vulnerability”

    July 10, 2025

    CVE ID : CVE-2025-38343

    Published : July 10, 2025, 9:15 a.m. | 4 hours, 51 minutes ago

    Description : In the Linux kernel, the following vulnerability has been resolved:

    wifi: mt76: mt7996: drop fragments with multicast or broadcast RA

    IEEE 802.11 fragmentation can only be applied to unicast frames.
    Therefore, drop fragments with multicast or broadcast RA. This patch
    addresses vulnerabilities such as CVE-2020-26145.

    Severity: 0.0 | NA

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    Mastering Linux File Permissions and Ownership

    April 4, 2025

    Firefox 141 quietly arrives with AI-powered tab groups and more

    July 22, 2025

    Microsoft to Standardize Online Services Pricing from November

    August 14, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.