Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Microsoft Released SuperBench: A Groundbreaking Proactive Validation System to Enhance Cloud AI Infrastructure Reliability and Mitigate Hidden Performance Degradations

    Microsoft Released SuperBench: A Groundbreaking Proactive Validation System to Enhance Cloud AI Infrastructure Reliability and Mitigate Hidden Performance Degradations

    August 20, 2024

    Cloud AI infrastructure is vital to modern technology, providing the backbone for various AI workloads and services. Ensuring the reliability of these infrastructures is crucial, as any failure can lead to widespread disruption, particularly in large-scale distributed systems where AI workloads are synchronized across numerous nodes. This synchronization means that a failure in one node can have cascading effects, magnifying the impact and causing significant downtime or performance degradation. The complexity and scale of these systems make it essential to have robust mechanisms in place to maintain their smooth operation and minimize incidents that could affect the quality of service provided to users.

    One of the primary challenges in maintaining cloud AI infrastructure is addressing hidden degradations due to hardware redundancies. These subtle failures, often termed “gray failures,” do not cause immediate, catastrophic problems but gradually degrade performance over time. These issues are particularly problematic because they are not easily detectable with conventional monitoring tools, typically designed to identify more apparent binary failure states. The insidious nature of gray failures complicates the task of root cause analysis, making it difficult for cloud providers to identify and rectify the underlying problems before they escalate into more significant issues that could impact the entire system.

    Cloud providers have traditionally relied on hardware redundancies to mitigate these hidden issues and ensure system reliability. Redundant components, such as extra GPU compute units or over-provisioned networking links, are intended to act as fail-safes. However, these redundancies can inadvertently introduce their own set of problems. Over time, continuous and repetitive use of these redundant components can lead to gradual performance degradation. For example, in Azure A100 clusters, where InfiniBand top-of-rack (ToR) switches have multiple redundant uplinks, the loss of some of these links can lead to throughput regression, particularly under certain traffic patterns. This gradual degradation type often goes unnoticed until it significantly impacts AI workloads, which becomes much more challenging to address.

    A team of researchers from Microsoft Research and Microsoft introduced SuperBench, a proactive validation system designed to enhance cloud AI infrastructure’s reliability by addressing the hidden degradation problem. SuperBench performs a comprehensive evaluation of hardware components under realistic AI workloads. The system includes two main components: a Validator, which learns benchmark criteria to identify defective components, and a Selector, which optimizes the timing and scope of the validation process to ensure it is both effective and efficient. SuperBench can run diverse benchmarks representing most real AI workloads, allowing it to detect subtle performance regressions that might otherwise go unnoticed.

    The technology behind SuperBench is sophisticated and tailored to address the unique challenges cloud AI infrastructures pose. The Validator component of SuperBench conducts a series of benchmarks on specified nodes, learning to distinguish between normal and defective performance by analyzing the cumulative distribution of benchmark results. This approach ensures that even slight deviations in performance, which could indicate a potential problem, are detected early. Meanwhile, the Selector component balances the trade-off between validation time and the possible impact of incidents. Using a probability model to predict the likelihood of incidents, the Selector determines the optimal time to run specific benchmarks. This ensures that validation is performed when it is most likely to prevent issues.

    The effectiveness of SuperBench is demonstrated by its deployment in Azure’s production environment, where it has been used to validate hundreds of thousands of GPUs. Through rigorous testing, SuperBench has been shown to increase the mean time between incidents (MTBI) by up to 22.61 times. By reducing the time required for validation and focusing on the most critical components, SuperBench has decreased the cost of validation time by 92.07% while simultaneously increasing user GPU hours by 4.81 times. These impressive results highlight the system’s ability to detect and prevent performance issues before they impact end-to-end workloads.

    In conclusion, SuperBench, by focusing on the early detection and resolution of hidden degradations, offers a robust solution to the complex challenge of ensuring the continuous and reliable operation of large-scale AI services. The system’s ability to identify subtle performance regressions and optimize the validation process makes it an invaluable tool for cloud service providers looking to enhance the reliability of their AI infrastructures. With SuperBench, Microsoft has set a new standard for cloud infrastructure maintenance, ensuring that AI workloads can be executed with minimal disruption and maximum efficiency, thus maintaining high-performance standards in a rapidly evolving technological landscape.

    Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

    Don’t Forget to join our 48k+ ML SubReddit

    Find Upcoming AI Webinars here

    The post Microsoft Released SuperBench: A Groundbreaking Proactive Validation System to Enhance Cloud AI Infrastructure Reliability and Mitigate Hidden Performance Degradations appeared first on MarkTechPost.

    Source: Read More 

    Hostinger
    Facebook Twitter Reddit Email Copy Link
    Previous ArticlePeriodWave: A Novel Universal Waveform Generation Model
    Next Article Improving Robustness Against Bias in Social Science Machine Learning: The Promise of Instruction-Based Models

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 17, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-40906 – MongoDB BSON Serialization BSON::XS Multiple Vulnerabilities

    May 17, 2025
    Leave A Reply Cancel Reply

    Hostinger

    Continue Reading

    CVE-2025-44861 – TOTOLINK CA300-POE Command Injection

    Common Vulnerabilities and Exposures (CVEs)

    I changed 10 settings on my Android smartwatch to drastically improve battery life

    Development

    Edge resurrects Secure Network hub from the dead. Here’s what we’ve learned so far

    Operating Systems

    Redefining the Database for AI: Why MongoDB Acquired Voyage AI

    Databases

    Highlights

    Apache Parquet Java Vulnerability Let Attackers Execute Arbitrary Code

    May 5, 2025

    Apache Parquet Java Vulnerability Let Attackers Execute Arbitrary Code

    A new critical security vulnerability in Apache Parquet Java has been disclosed that could allow attackers to execute arbitrary code through specially crafted Parquet files.
    The vulnerability, tracked …
    Read more

    Published Date:
    May 05, 2025 (5 hours, 8 minutes ago)

    Vulnerabilities has been mentioned in this article.

    CVE-2025-0049 – GoAnywhere Directory Traversal Information Disclosure

    April 28, 2025

    Critical Flaw in Apache Parquet Allows Remote Attackers to Execute Arbitrary Code

    April 4, 2025

    Smashing Security podcast #397: Snowflake hackers, and under the influence

    December 20, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.