Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 20, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 20, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 20, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 20, 2025

      GPT-5 should have a higher “degree of scientific certainty” than the current ChatGPT — but with less model switching

      May 20, 2025

      Elon Musk’s Grok 3 AI coming to Azure proves Satya Nadella’s allegiance isn’t to OpenAI, but to maximizing Microsoft’s profit gains by heeding consumer demands

      May 20, 2025

      One of the most promising open-world RPGs in years is releasing next week on Xbox and PC

      May 20, 2025

      NVIDIA’s latest driver fixes some big issues with DOOM: The Dark Ages

      May 20, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Community News: Latest PECL Releases (05.20.2025)

      May 20, 2025
      Recent

      Community News: Latest PECL Releases (05.20.2025)

      May 20, 2025

      Getting Started with Personalization in Sitecore XM Cloud: Enable, Extend, and Execute

      May 20, 2025

      Universal Design and Global Accessibility Awareness Day (GAAD)

      May 20, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      GPT-5 should have a higher “degree of scientific certainty” than the current ChatGPT — but with less model switching

      May 20, 2025
      Recent

      GPT-5 should have a higher “degree of scientific certainty” than the current ChatGPT — but with less model switching

      May 20, 2025

      Elon Musk’s Grok 3 AI coming to Azure proves Satya Nadella’s allegiance isn’t to OpenAI, but to maximizing Microsoft’s profit gains by heeding consumer demands

      May 20, 2025

      One of the most promising open-world RPGs in years is releasing next week on Xbox and PC

      May 20, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»UB-Mesh: A Cost-Efficient, Scalable Network Architecture for Large-Scale LLM Training

    UB-Mesh: A Cost-Efficient, Scalable Network Architecture for Large-Scale LLM Training

    April 3, 2025

    As LLMs scale, their computational and bandwidth demands increase significantly, posing challenges for AI training infrastructure. Following scaling laws, LLMs improve comprehension, reasoning, and generation by expanding parameters and datasets, necessitating robust computing systems. Large-scale AI clusters now require tens of thousands of GPUs or NPUs, as seen in LLAMA-3’s 16K GPU training setup, which took 54 days. With AI data centers deploying over 100K GPUs, scalable infrastructure is essential. Additionally, interconnect bandwidth requirements surpass 3.2 Tbps per node, far exceeding traditional CPU-based systems. The rising costs of symmetrical Clos network architectures make cost-effective solutions critical, alongside optimizing operational expenses such as energy and maintenance. Moreover, high availability is a key concern, as massive training clusters experience frequent hardware failures, demanding fault-tolerant network designs.

    Addressing these challenges requires rethinking AI data center architecture. First, network topologies should align with LLM training’s structured traffic patterns, which differ from traditional workloads. Tensor parallelism, responsible for most data transfers, operates within small clusters, while data parallelism involves minimal but long-range communication. Second, computing and networking systems must be co-optimized, ensuring effective parallelism strategies and resource distribution to avoid congestion and underutilization. Lastly, AI clusters must feature self-healing mechanisms for fault tolerance, automatically rerouting traffic or activating backup NPUs when failures occur. These principles—localized network architectures, topology-aware computation, and self-healing systems—are essential for building efficient, resilient AI training infrastructures.

    Huawei researchers introduced UB-Mesh, an AI data center network architecture designed for scalability, efficiency, and reliability. Unlike traditional symmetrical networks, UB-Mesh employs a hierarchically localized nD-FullMesh topology, optimizing short-range interconnects to minimize switch dependency. Based on a 4D-FullMesh design, its UB-Mesh-Pod integrates specialized hardware and a Unified Bus (UB) technique for flexible bandwidth allocation. The All-Path Routing (APR) mechanism enhances data traffic management, while a 64+1 backup system ensures fault tolerance. Compared to Clos networks, UB-Mesh reduces switch usage by 98% and optical module reliance by 93%, achieving 2.04× cost efficiency with minimal performance trade-offs in LLM training.

    UB-Mesh is a high-dimensional full-mesh interconnect architecture designed to enhance efficiency in large-scale AI training. It employs an nD-FullMesh topology, minimizing reliance on costly switches and optical modules by maximizing direct electrical connections. The system is built on modular hardware components linked through a UB interconnect, streamlining communication across CPUs, NPUs, and switches. A 2D full-mesh structure connects 64 NPUs within a rack, extending to a 4D full-mesh at the Pod level. For scalability, a SuperPod structure integrates multiple Pods using a hybrid Clos topology, balancing performance, flexibility, and cost-efficiency in AI data centers.

    To enhance the efficiency of UB-Mesh in large-scale AI training, we employ topology-aware strategies for optimizing collective communication and parallelization. For AllReduce, a Multi-Ring algorithm minimizes congestion by efficiently mapping paths and utilizing idle links to enhance bandwidth. In all-to-all communication, a multi-path approach boosts data transmission rates, while hierarchical methods optimize bandwidth for broadcasting and reduce operations. Additionally, the study refines parallelization through a systematic search, prioritizing high-bandwidth configurations. Comparisons with Clos architecture reveal that UB-Mesh maintains competitive performance while significantly reducing hardware costs, making it a cost-effective alternative for large-scale model training.

    In conclusion, the UB IO controller incorporates a specialized co-processor, the Collective Communication Unit (CCU), to optimize collective communication tasks. The CCU manages data transfers, inter-NPU transmissions, and in-line data reduction using an on-chip SRAM buffer, minimizing redundant memory copies and reducing HBM bandwidth consumption. It also enhances computer-communication overlap. Additionally, UB-Mesh efficiently supports massive-expert MoE models by leveraging hierarchical all-to-all optimization and load/store-based data transfer. The study introduces UB-Mesh, an nD-FullMesh network architecture for LLM training, offering cost-efficient, high-performance networking with 95%+ linearity, 7.2% improved availability, and 2.04× better cost efficiency than Clos networks.


    Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

    🔥 [Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

    The post UB-Mesh: A Cost-Efficient, Scalable Network Architecture for Large-Scale LLM Training appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleIntroduction to MCP: The Ultimate Guide to Model Context Protocol for AI Assistants
    Next Article This AI Paper Unveils a Reverse-Engineered Simulator Model for Modern NVIDIA GPUs: Enhancing Microarchitecture Accuracy and Performance Prediction

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    May 20, 2025
    Machine Learning

    Chain-of-Thought May Not Be a Window into AI’s Reasoning: Anthropic’s New Study Reveals Hidden Gaps

    May 20, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    CVE-2025-4195 – iSourcecode Gym Management System SQL Injection

    Common Vulnerabilities and Exposures (CVEs)

    Navigating JavaScript with Short-Circuiting

    Development

    Lexi is a self-driven dictionary app

    Linux

    Can LLMs Help Accelerate the Discovery of Data-Driven Scientific Hypotheses? Meet DiscoveryBench: A Comprehensive LLM Benchmark that Formalizes the Multi-Step Process of Data-Driven Discovery

    Development
    Hostinger

    Highlights

    News & Updates

    Avowed: All Totem fragment locations: Totem of Rightful Rulership, Totem of Defiance, Totem of Revelations and Totem of Perseverance

    February 17, 2025

    Looking for the Totem pieces in Dawnshore , Emerald Stair and beyond? Here’s a guide…

    Apple Mail can help you write emails now – here’s how

    November 19, 2024

    FinTextQA: A Long-Form Question Answering LFQA Dataset Specifically Designed for the Financial Domain

    May 20, 2024

    AI skills or AI-enhanced skills? What employers need could depend on you

    June 17, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.