Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Designing Better UX For Left-Handed People

      July 25, 2025

      This week in AI dev tools: Gemini 2.5 Flash-Lite, GitLab Duo Agent Platform beta, and more (July 25, 2025)

      July 25, 2025

      Tenable updates Vulnerability Priority Rating scoring method to flag fewer vulnerabilities as critical

      July 24, 2025

      Google adds updated workspace templates in Firebase Studio that leverage new Agent mode

      July 24, 2025

      Trump’s AI plan says a lot about open source – but here’s what it leaves out

      July 25, 2025

      Google’s new Search mode puts classic results back on top – how to access it

      July 25, 2025

      These AR swim goggles I tested have all the relevant metrics (and no subscription)

      July 25, 2025

      Google’s new AI tool Opal turns prompts into apps, no coding required

      July 25, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Laravel Scoped Route Binding for Nested Resource Management

      July 25, 2025
      Recent

      Laravel Scoped Route Binding for Nested Resource Management

      July 25, 2025

      Add Reactions Functionality to Your App With Laravel Reactions

      July 25, 2025

      saasykit/laravel-open-graphy

      July 25, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Sam Altman won’t trust ChatGPT with his “medical fate” unless a doctor is involved — “Maybe I’m a dinosaur here”

      July 25, 2025
      Recent

      Sam Altman won’t trust ChatGPT with his “medical fate” unless a doctor is involved — “Maybe I’m a dinosaur here”

      July 25, 2025

      “It deleted our production database without permission”: Bill Gates called it — coding is too complex to replace software engineers with AI

      July 25, 2025

      Top 6 new features and changes coming to Windows 11 in August 2025 — from AI agents to redesigned BSOD screens

      July 25, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Muon Optimizer Significantly Accelerates Grokking in Transformers: Microsoft Researchers Explore Optimizer Influence on Delayed Generalization

    Muon Optimizer Significantly Accelerates Grokking in Transformers: Microsoft Researchers Explore Optimizer Influence on Delayed Generalization

    April 23, 2025

    Revisiting the Grokking Challenge

    In recent years, the phenomenon of grokking—where deep learning models exhibit a delayed yet sudden transition from memorization to generalization—has prompted renewed investigation into training dynamics. Initially observed in small algorithmic tasks like modular arithmetic, grokking reveals that models can reach near-perfect training accuracy while validation performance remains poor for a prolonged period. Eventually, and often abruptly, the model begins to generalize. Understanding what governs this transition is important not just for interpretability, but also for optimizing training efficiency in deep networks. Prior studies have highlighted the role of weight decay and regularization. However, the specific influence of optimizers on this process has been underexplored.

    Investigating Optimizer Effects on Grokking

    This AI paper from Microsoft examines the impact of optimizer choice on grokking behavior. Specifically, it contrasts the performance of the widely adopted AdamW optimizer with Muon, a newer optimization algorithm that incorporates spectral norm constraints and second-order information. The study investigates whether these features enable Muon to expedite the generalization phase.

    The experiments span seven algorithmic tasks—primarily modular arithmetic operations and parity classification—using a modern Transformer architecture. Each task is designed to reliably exhibit grokking under appropriate training conditions. The research also includes a comparative analysis of softmax variants (standard softmax, stablemax, and sparsemax) to evaluate whether output normalization plays a secondary role in modulating training dynamics. However, the core investigation centers on the optimizer.

    Architectural and Optimization Design

    The underlying model architecture adopts standard Transformer components, implemented in PyTorch. It includes multi-head self-attention, rotary positional embeddings (RoPE), RMS normalization, SiLU activations, and dropout-based regularization. Input tokens—numerical values or operators—are encoded through simple identity embeddings.

    The key distinction lies in the optimizer behavior:

    • AdamW, a baseline in contemporary deep learning workflows, uses adaptive learning rates with decoupled weight decay.
    • Muon, in contrast, applies orthogonalized gradients, enforces spectral norm constraints to stabilize training, and approximates second-order curvature for more informative updates.

    These mechanisms are intended to promote broader exploration during optimization, mitigate instability (e.g., “softmax collapse”), and synchronize learning progress across layers. Muon’s ability to regulate update magnitude in accordance with layer dimensions is particularly relevant in avoiding inefficient memorization pathways.

    Three softmax configurations—Softmax, Stablemax, and Sparsemax—are included to assess whether numerical stability or sparsity of the output distribution influences grokking. This helps ensure that the observed effects stem primarily from optimizer dynamics rather than output activation nuances.

    Empirical Evaluation and Results

    The study’s empirical protocol is methodically designed. Each optimizer-softmax-task combination is evaluated across multiple seeds to ensure statistical robustness. Grokking is operationally defined as the first epoch where validation accuracy surpasses 95% following training accuracy stabilization.

    The results indicate a consistent and statistically significant advantage for Muon. On average, Muon reaches the grokking threshold in 102.89 epochs, compared to 153.09 epochs for AdamW. This difference is not only numerically large but also statistically rigorous (t = 5.0175, p ≈ 6.33e−8). Additionally, Muon demonstrates a tighter distribution of grokking epochs across all conditions, suggesting more predictable training trajectories.

    All tasks were conducted on NVIDIA H100 GPUs using a unified codebase and standardized configurations. Tasks include modular addition, multiplication, division, exponentiation, GCD, and a 10-bit parity task. Dataset sizes ranged from 1,024 to 9,409 examples, with training-validation splits adjusted per task to maintain consistency.

    Conclusion

    The findings provide strong evidence that optimizer geometry significantly influences the emergence of generalization in overparameterized models. By steering the optimization path through second-order-aware updates and spectral norm constraints, Muon appears to facilitate a more direct route toward discovering the underlying data structure, bypassing prolonged overfitting phases.

    This study underscores the broader need to consider optimization strategy as a first-class factor in neural training design. While prior work emphasized data and regularization, these results suggest that optimizer architecture itself can play a pivotal role in shaping training dynamics.


    Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

    🔥 [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

    The post Muon Optimizer Significantly Accelerates Grokking in Transformers: Microsoft Researchers Explore Optimizer Influence on Delayed Generalization appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleBest Free and Open Source Alternatives to Corel Font Viewer
    Next Article LLMs Can Now Learn without Labels: Researchers from Tsinghua University and Shanghai AI Lab Introduce Test-Time Reinforcement Learning (TTRL) to Enable Self-Evolving Language Models Using Unlabeled Data

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 25, 2025
    Machine Learning

    Unsupervised System 2 Thinking: The Next Leap in Machine Learning with Energy-Based Transformers

    July 25, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Ubuntu 25.10 “Questing Quokka”: Rimozione della Sessione GNOME su Xorg (X11)

    Linux

    CVE-2025-3818 – Webpy Web.py PostgresDB SQL Injection Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    How to create an SVG viewer in HTML, CSS, and JavaScript

    Web Development

    CVE-2025-5959 – Google Chrome V8 Type Confusion Arbitrary Code Execution Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    CVE-2025-23121 Remote Code Execution in Veeam

    June 19, 2025

    CVE-2025-23121 Remote Code Execution in Veeam

    📌 OverviewCVE-2025-23121 is a critical remote code execution (RCE) vulnerability identified in Veeam Backup & Replication (VBR) software. The flaw affects domain-joined backup servers and allows any a …
    Read more

    Published Date:
    Jun 19, 2025 (5 hours, 31 minutes ago)

    Vulnerabilities has been mentioned in this article.

    CVE-2025-24287

    CVE-2025-24286

    CVE-2025-23121

    CVE-2025-26685

    CVE-2025-2783

    CVE-2024-29212

    CVE-2022-46739 – Apache Struts Remote Command Execution Vulnerability

    May 28, 2025

    A customizable and accessible web component

    May 12, 2025

    NVIDIA’s latest driver fixes some big issues with DOOM: The Dark Ages

    May 20, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.