Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»How to investigate the online vs offline performance for DNN models

    How to investigate the online vs offline performance for DNN models

    December 20, 2024

    Predictive model performance gap between offline evaluations and online inference is a common and persistent challenge in the ML industry, often preventing models from achieving their full business potential. At DoorDash, this issue is particularly critical for deep learning models, as it impacts multiple teams across domains. Bridging this gap is essential for maximizing the business value of these models.

    The Ads Quality ML team encountered the same challenge for our latest few ranking model iterations. In this blog, using the latest launched model iterations as a case study, we will walk through the debugging process, and share a scalable methodology framework for investigating and resolving these discrepancies. 

    By adopting the solution proposed in the blog, we reduce the online-offline AUC gap from 4.3% to 0.76%.

    Our experience highlights critical areas such as feature serving consistency, feature freshness, and potential concerns when integrating real-time features for model serving. These insights can guide future efforts to improve offline and online performance alignment.

    Model development context

    Restaurant Discovery Ads, the primary entry point for ads in the app, contributes the largest share of ad revenue. Key milestones since early 2023 are summarized in Fig-1. After evaluating various model architectures, we have adopted the Multi-Task Multi-Label (MTML) model architectures.

    Figure 1: Restaurant Discovery Ads Ranking Deep Learning Milestones

    The goal for this milestone (M4), Multi-MTML V4 is to add more features to further improve the model performance. For the online-offline AUC gap investigation in the middle of this milestone, on top of existing features, we added more than 40 dense features. 

    Feature CategoryFeature Data TypeDescription
    Existing FeaturesDense FeaturesMostly consumer engagement features
    Sequence FeaturesConsumer-engaged business/food/cuisine tag sequence & contextual features
    Newly Added FeaturesDense FeaturesConsumer promotion-related featuresAdditional consumer engagement features
    Table 1: Model for investigation Feature details

    Identification of the problem

    We developed a model using widely accepted offline training data construction rules, where impression data is joined with feature values from the previous day. This approach simulates the scenario of leveraging ‘yesterday’s’ feature values for ‘today’s’ model inference.

    However, we observed a 4% decline in AUC during real-time online inference compared to offline evaluation. Notably, the online AUC was also much lower than the baseline achieved by the current production model. 

    Thought process of root causing

    Begin with a hypothesis-driven approach

    We begin with a hypothesis-driven approach, where we design experiments to validate or invalidate each hypothesis. The initial hypotheses include:

    1. Feature Generation Disparity: This often arises when the offline feature pipeline does not mimic the online environment, leading to discrepancies in real-time predictions.
    2. Data Distribution Shift (Concept Drift): Changes in the underlying data distribution over time, such as seasonal changes in consumer behavior, could significantly impact model generalization. 
    3. Model Serving Instability: Issues with the model serving infrastructure, such as latency or incorrect model versions, might be affecting online performance.

    For each hypothesis, we conduct experiments, analyze the data, identify root causes, apply fixes, and redeploy the model. This iterative process continues until all performance gaps are resolved or new issues emerge.

    Design of experiment – offline replay

    To test feature disparity, we regenerate the evaluation dataset using the same online impression traffic (from the shadow) with the same offline feature join process. We then run model inference and evaluate AUC. 

    1. If the Reply AUC is similar to that from previous offline evaluation, it confirms the offline to online performance recession is due to feature disparities. 
    2. If the Reply AUC aligns with the shadow but remains lower than the previous offline AUC, it indicates the concept drift.

    Key results – AUC benchmark

    Eval Date RangeModelFeature GenerationOnline vs OfflineAUC
    One Week’s data in SeptemberBaseline (Prod Model)Online LoggingOnlineBaseline Value
    New modelOnline LoggingOnline Shadow-1.80%
    New model-1d offline join new added featuresOffline Replay+2.05%
    One Week’s data in JuneBaseline (Prod Model)Online LoggingOffline+0.77%
    New model-1d offline join new added featuresOffline+2.105%
    Table 2: AUC Benchmarks for offline Reply
    • When the model first trained offline, the AUC on the eval set had 2.1% AUC improvement compared with the baseline value; when shadowed online on the week of 09/09, it was -1.8% decrease.
    • During the shadowing, we did not see any obvious outage on the Logging service (hence can temporarily rule out hypothesis #3 of serving instability).
    • Replaying the evaluation offline with the same feature generation process as training data on the shadow impressions, the AUC is 2.05% improvement, which is very close to 2.1%.
    • The above evidence suggests the main culprit is feature disparity.

    In the following section, we dive deeper to understand the feature disparities.

    Feature disparity investigation

    There are two potential root causes for such online and offline feature disparities: 

    Feature staleness vs cached residuals

    Feature Staleness occurs when most recent feature values are not available during serving. It is primarily due to at least -1d or -2d delays in the feature pipeline, with minor delays from feature uploads occurring a few hours after the data is ready.

    Cached Residuals occur when feature values are null or no longer available after the most recent pipeline run. Since the Online Feature Store only overrides existing keys or adds new ones without evicting the old entries, outdated data can persist.

    To gain further insights, we conduct a deep dive for newly added topmost important features to better understand the feature serving dynamics.

    Challenges

    Both of these cases are difficult to address perfectly using offline feature joining logic. Since:

    • For Feature Staleness, the SLA varies across features due to the variances in the pipeline implementations, processed data volume, availability of computation resources, feature uploading velocities, etc.
    • For Cached Residuals, since it’s unknown how many existing feature values in the Online Feature Store are absolute values, thus hard to tell which entity suffers from this case mostly. 

    Key insights

    • Feature staleness – The current -1d offsite for offline feature join is a very aggressive choice. Analysis of online logging shows that most feature values used for a given prediction were from the consumer’s engagement 2-3d earlier, indicating a long SLA from data generation to feature availability.
    • Cached residuals – We assume that any feature values older than 4 days are likely due to long-lived historical values remaining in the Feature Store. This results in a lower missing rate during online serving and is a key factor contributing to feature disparity.
    • Ubiquity: Both Feature Staleness and Cached Residuals impact most features, though the severity of these issues varies depending on the feature.
    • AUC Gaps: By generating simulated data with longer feature active_date offsites (from 1d to 3d/4d), we observed a reduced AUC gap, validating the impact of these issues.

    Cached residuals

    Features most impacted by Cached Residuals are likely to have these characteristics:

    • High cardinality (such as cross features at consumer levels) and volatile values (e.g., values change frequently from day to day).
    • Aggregated over short time windows, e.g. aggregated features with past 1 day/7 day data, making them more susceptible to outdated data being served from the cache.

    For the 10 most important new features, we have listed their mostly observed staleness and % of fetched residuals during online serving: 

    Feature NameFeature Aggregation LevelFeature Aggregation Time WindowFeature StalenessFeature Missing Rate% of cached residuals
    Feature 1Consumer levelPast 1 year-3d/-4d2.77%1.15%
    Feature 2<Consumer, Store> levelPast 3 month-3d76.20%45.6%
    Feature 3<Consumer, Store> levelPast half-year-3d70.19%23.6%
    Feature 4Consumer levelPast 1 year-3d/-4d2.77%6.07%
    Feature 5Consumer levelPast 3 months-3d45.18%4.56%
    Feature 6Consumer levelPast 3 months-3d76.19%31.0%
    Feature 7Consumer levelPast 1 month-2d76.19%31.0%
    Feature 8Store levelPast 3 months-3d0.92%34.9%
    Feature 9Consumer levelPast 3 months-2d45.18%4.55%
    Feature 10Store levelPast 1 day-3d23.50%24.7%
    Table 3: Cache Residuals of the 10 most important added features

    Online vs offline feature distribution

    For served features with a higher concentration of cached residuals, we observed a noticeably lower missing rate online compared to offline. This discrepancy is reflected in the misalignment of feature distributions between online and offline, as seen in Fig-2. In the examples below, missing values for both features are imputed as 0, leading to pronounced peaks around 0 in the distributions.

    Figure 2: Feature Distribution Online (red) vs Offline (blue)

    Feature staleness

    Feature-to-feature variations

    The actual staleness varies across features due to the variances in the pipeline implementations, processed data volume, availability of computation resources, and feature uploading velocities. Below are two examples to illustrate such variations. 

    • Feature 1: -3d to -4d delay
    • Feature 10: -1d delay
    Figure 3: Feature Staleness

    * Disclaimer: Figure 3 may underestimate feature staleness, as small day-to-day feature value changes make it hard to pinpoint the exact feature uploading date. This data should be taken with a grain of salt.

    Day-to-day feature value change

    For most features, only less than 10% of feature values are different from previous days, while features with small aggregation time windows have more than 35% difference. 

    Feature NameFeature Aggregation LevelFeature Aggregation Time Window% of entities change**% of feature value mismatch***
    Feature 1Consumer levelPast 1 year0.0258%9.69%
    Feature 2<Consumer, Store> levelPast 3 month1.04%5.2%
    Feature 3<Consumer, Store> levelPast half-year3.36%11.3%
    Feature 4Consumer levelPast 1 year0.0258%9.68%
    Feature 5Consumer levelPast 3 months1.0%8.37%
    Feature 6Consumer levelPast 3 months1.04%4.66%
    Feature 7Consumer levelPast 1 month1.04%4.66%
    Feature 8Store levelPast 3 months0.09%4.38%
    Feature 9Consumer levelPast 3 months1.04%4.78%
    Feature 10Store levelPast 1 day0.436%35.7%
    Table 4: Cache Residuals of the 10 most important added features
    ** the percentage of entity_ids that did not show up in the previous data.
    ***the percentage of feature values that are different from the previous day. 

    Validate hypothesis and close the loop

    To summarize the above investigation and close the hypothesis validation loop, we build new training and evaluation datasets and run evaluations. For the features inherited from the current model, we continued to use the Real-time Serving logged values in all the training sets.

    • Rebuild 4 datasets with impression data joining with minus 1/2/3/4 day feature offset dates for both training and evaluation datasets and train 4 models respectively. 
    • Evaluate 4 model performances on 2 datasets (date ranges are different between them):
      • Offline Evaluation dataset – The eval dataset with the same feature offset dates as training. 

    Shadow Log dataset – All feature values are logged values from online real-time model serving. 

    Figure 4: AUC Relative Changes on models trained by -1/2/3/4 day feature offsets

    From the above Fig-4, if we trace out the offline AUC vs Feature offset days (e.g. delayness), it suggests that model performance degrades as feature freshness decreases, highlighting the importance of timely feature updates in maintaining optimal model accuracy.

    The most significant AUC drop comes from 1d delay to 2d. The rationale for picking offline AUC instead of online as the benchmark is to rule out the impact of feature disparity.

    Proposed solutions

    Short-term: Generate evaluation sets with different feature offsets (e.g., -2d, -3d, -4d) and select the offset closest to production AUC. Use this offset to create training data and build the model.

    Long-term: Enable online logging for new features. However, there’s a clear trade-off between development speed and data accuracy, which needs careful consideration during the project planning stage.

    SolutionProsCons
    Short-termReduces AUC discrepancy immediatelyDoes not address cached residuals
    Long-termEffectively resolves both cached residuals and feature staleness.Improves model generalization.Has the trade-off between development speed and data accuracyRequires system stability improvements to support feature logging of larger traffic.
    Table 5: Comparison between short-term and long-term solutions

    Experiment result and conclusion

    By adopting the short-term solution proposed in the blog, we reduce the online offline AUC gap from 4.3% to 0.76% for our latest Restaurant Discovery Ads Ranking model Deep Learning Iteration. Combined with other feature improvements, this iteration achieved the largest business gain among Ads Ranking model Iterations this year.

    This investigation not only resolved immediate performance gaps but also highlighted the importance of feature alignment in real-time systems. The methodology developed here can serve as a blueprint for addressing similar challenges across other domains. Moving forward, adopting robust logging systems and scaling feature pipelines will ensure that our models continue to drive impactful business outcomes.

    Other thoughts

    Why was the online vs offline gap not as significant before 2023?

    Scale of the Business: The ads business has grown 3-5x over the past year, leading to increased data volume and feature complexity, which has amplified the impact of feature staleness and cached residuals.

    Model Architecture: Previously, tree-based models were less sensitive to feature disparities because they bucketize values into leaf nodes, meaning small differences often have minimal impact. However, DNNs, being parametric models, are much more sensitive to precise feature values, where even slight deviations can affect the model’s output. This shift to DNNs has made the online vs. offline gap more significant.

    Acknowledgments

    Special thanks to Hebo Yang, Mark Yang, Ferras Hamad, Vasily Vlasov and Arvind Kalyan from Machine Learning Platform for all the collaborations on Ads User Engagement Prediction Modeling. 

    Interested in solving these kind of problems? Come join us!

    The post How to investigate the online vs offline performance for DNN models appeared first on DoorDash.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleInsights about GitHub Copilot
    Next Article Distributed build service for monorepos

    Related Posts

    Machine Learning

    Salesforce AI Releases BLIP3-o: A Fully Open-Source Unified Multimodal Model Built with CLIP Embeddings and Flow Matching for Image Understanding and Generation

    May 16, 2025
    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 16, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    A Reflection on Perficient’s Employee Resource Groups (ERGs): Building Space for Our Colleagues

    Development

    Newsletter #39: Build With AssemblyAI’s Integrations

    Artificial Intelligence

    TunnelBear VPN review: An affordable, easy-to-use VPN with a few notable pitfalls

    Development

    Can robots learn from machine dreams?

    Artificial Intelligence

    Highlights

    Development

    Unable to locate pseudo element using javascript executor in Selenium JAVA

    July 26, 2024

    I am currently having issues with locating an pseudo element for a web page. I have checked other articles to use javascript and tried but it still isnt getting it. Below is a screenshot of what the html is like and the code I wrote to get property and to be able to click on it. Your help will be needed guys
    enter image description here
    <div class=”article-actions” xpath=”1″><app-article-meta><div class=”article-meta”><a href=”/profile/test1234″><img src=”https://static.productionready.io/images/smiley-cyrus.jpg”></a><div class=”info”><a class=”author” href=”/profile/test1234″> test1234 </a><span class=”date”> June 29, 2021 </span></div><span><a class=”btn btn-sm btn-outline-secondary” href=”/editor/sdsd-r6g7gi”><i class=”ion-edit”></i> Edit Article </a><button class=”btn btn-sm btn-outline-danger disabled”><i class=”ion-trash-a”></i> Delete Article </button></span><span hidden=””><app-follow-button><button class=”btn btn-sm action-btn btn-outline-secondary”><i class=”ion-plus-round”></i> &nbsp; Follow test1234
    </button></app-follow-button><app-favorite-button><button class=”btn btn-sm btn-outline-primary”><i class=”ion-heart”></i> Favorite Article <span class=”counter”>(0)</span></button></app-favorite-button></span></div></app-article-meta></div>

    WebElement pseudoEle = driver.findElement(By.xpath(“//button//i[@class=”ion-trash-a”]/parent::button”));
    String display = ((JavascriptExecutor)driver).executeScript(“return window.getComputedStyle(arguments[0], ‘:before’).getPropertyValue(‘content’);”,pseudoEle).toString();
    System.out.println(display);

    I also need to be able to click on the element. Note the xpath returns two of the pseudo element that are identical on the page.

    I found a near-perfect Kindle and it’s on sale for a limited time

    February 7, 2025

    Rilasciato Waydroid 1.5: Esegui applicazioni Android su GNU/Linux

    April 12, 2025

    How to Import KMZ Files Into Google Maps [Step-by-Step]

    June 24, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.