Huawei Introduces Pangu Ultra MoE: A 718B-Parameter Sparse Language Model Trained Efficiently on Ascend NPUs Using Simulation-Driven Architecture and System-Level Optimization

Sparse large language models (LLMs) based on the Mixture of Experts (MoE) framework have gained traction for their ability to scale efficiently by activating only a subset of parameters per token. This dynamic sparsity allows MoE models to retain high representational capacity while limiting computation per token. However, with their increasing complexity and model size approaching trillions of parameters, training them efficiently requires algorithmic innovation and a tightly integrated hardware-software optimization. These challenges are especially relevant when deploying models on non-standard AI accelerators like Ascend NPUs, which require specific architectural alignment to deliver optimal performance.

A major technical challenge lies in the inefficient utilization of hardware resources while training sparse LLMs. Since only a portion of parameters are active for each token, workloads across devices become unbalanced, leading to synchronization delays and underused processing power. This imbalance also affects memory utilization as different experts process different numbers of tokens, sometimes exceeding capacity. These inefficiencies are compounded at a large scale, such as across thousands of AI chips, where communication and memory management bottlenecks significantly hinder throughput. The inability to fully harness the computational promise of sparsity in practice restricts the deployment of such models on hardware systems like Ascend NPUs.

Several strategies have been proposed to tackle these challenges. These include auxiliary losses to balance token distribution across experts and drop-and-pad strategies that limit expert overload by discarding tokens exceeding capacity. However, these techniques either reduce model performance or introduce inefficiencies in memory and computation. Other efforts include heuristic expert placement and traditional communication patterns like All-to-All dispatching, but these often fail to scale well or maintain high throughput. Moreover, standard memory-saving techniques like recomputation are usually coarse-grained, targeting whole layers instead of specific operations, leading to increased runtime without proportional memory savings.

Researchers from the Pangu team at Huawei Cloud introduced a highly structured and optimized training approach for large MoE models tailored to Ascend NPUs. They developed Pangu Ultra MoE, a sparse LLM with 718 billion parameters, focusing on aligning model architecture and system design with the capabilities of the Ascend hardware. Their approach begins with a simulation-based model configuration process that evaluates thousands of architecture variants using metrics grounded in actual hardware behavior. These simulations inform design decisions before any physical training is undertaken, thus saving substantial computational resources and enabling informed tuning of model hyperparameters.

The simulation method analyzes combinations of parameters such as the number of layers, hidden size, and expert count using a five-dimensional parallelism strategy that includes Pipeline Parallelism, Tensor Parallelism, Expert Parallelism, Data Parallelism, and Context Parallelism. The final model configuration adopted by Huawei included 256 experts, a hidden size 7680, and 61 transformer layers. To further optimize performance, researchers integrated an Adaptive Pipe Overlap mechanism to mask communication costs and used hierarchical All-to-All communication to reduce inter-node data transfer. They employed fine-grained recomputation, such as recomputing only key-value vectors in attention modules, and introduced tensor swapping to offload activation memory to host devices dynamically.

Pangu Ultra MoE achieved a Model Flops Utilization (MFU) of 30.0% and processed tokens at a rate of 1.46 million per second using 6,000 Ascend NPUs. The baseline MFU was 18.9% with 0.61 million tokens per second on 4,000 NPUs. The researchers also introduced dynamic expert placement strategies, improving device-level load balance and achieving a relative 10% MFU improvement. The model performed competitively on benchmark evaluations, attaining 81.3% on AIME2024, 97.4% on MATH500, 94.8% on CLUEWSC, and 91.5% on MMLU. In the healthcare domain, it outperformed DeepSeek R1 by scoring 87.1% on MedQA and 80.8% on MedMCQA, confirming its strength in domain-specific applications.

This study illustrates how the Pangu team at Huawei effectively tackled the core difficulties of training massive MoE models on specialized hardware. Their systematic architecture search, efficient communication techniques, and tailored memory optimizations represent a strong framework for scalable AI training. The work demonstrates practical ways to unlock the performance potential of sparse models and sets a direction for future system-aware AI design.

Check out Paper here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit.

Here’s a brief overview of what we’re building at Marktechpost:

ML News Community – r/machinelearningnews (92k+ members)
Newsletter– airesearchinsights.com/(30k+ subscribers)
miniCON AI Events – minicon.marktechpost.com
AI Reports & Magazines – magazine.marktechpost.com
AI Dev & Research News – marktechpost.com (1M+ monthly readers)

The post Huawei Introduces Pangu Ultra MoE: A 718B-Parameter Sparse Language Model Trained Efficiently on Ascend NPUs Using Simulation-Driven Architecture and System-Level Optimization appeared first on MarkTechPost.

Source: Read MoreÂ

The Power Of The Intl API: A Definitive Guide To Browser-Native Internationalization

This week in AI dev tools: GPT-5, Claude Opus 4.1, and more (August 8, 2025)

Elastic simplifies log analytics for SREs and developers with launch of Log Essentials

OpenAI launches GPT-5

I compared the best headphones from Apple, Sony, Bose, and Sonos: Here’s how the AirPods Max wins

I changed these 6 settings on my iPad to significantly improve its battery life

DistroWatch Weekly, Issue 1134

3 portable power stations I travel everywhere with (and how they differ)

Next.js PWA offline capability with Service Worker, no extra package

Next.js PWA offline capability with Service Worker, no extra package

spatie/laravel-flare

Establishing Consistent Data Foundations with Laravel’s Database Population System

Windows 11 Copilot gets free access to GPT-5 Thinking, reduced rate limits than ChatGPT Free

Windows 11 Copilot gets free access to GPT-5 Thinking, reduced rate limits than ChatGPT Free

Best Architecture AI Rendering Platform: 6 Tools Tested

Microsoft won’t kill off Chromium Edge and PWAs on Windows 10 until October 2028

Huawei Introduces Pangu Ultra MoE: A 718B-Parameter Sparse Language Model Trained Efficiently on Ascend NPUs Using Simulation-Driven Architecture and System-Level Optimization

How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

AI Agent Trends of 2025: A Transformative Landscape

5 simple ways to start taking control of your online privacy today

Best Memorial Day deals 2025: Save on handpicked early tech deals live now

Right Invoicing App for iPhone: InvoiceTemple

Shiori is a simple bookmark manager

Amazon locked NVIDIA’s RTX 50-series GPUs behind a Prime membership — Here are all the cards you can buy right now

Microsoft Teams will fix meeting chats for presenters with this small change

How to Create Serverless AI Agents with Langbase Docs MCP Server in Minutes

CVE-2025-5447 – Linksys Wireless Router OS Command Injection Vulnerability

Huawei Introduces Pangu Ultra MoE: A 718B-Parameter Sparse Language Model Trained Efficiently on Ascend NPUs Using Simulation-Driven Architecture and System-Level Optimization

Related Posts