Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Generalization of Gradient Descent in Over-Parameterized ReLU Networks: Insights from Minima Stability and Large Learning Rates

    Generalization of Gradient Descent in Over-Parameterized ReLU Networks: Insights from Minima Stability and Large Learning Rates

    June 16, 2024

    Gradient descent-trained neural networks operate effectively even in overparameterized settings with random weight initialization, often finding global optimum solutions despite the non-convex nature of the problem. These solutions, achieving zero training error, surprisingly do not overfit in many cases, a phenomenon known as “benign overfitting.” However, for ReLU networks, interpolating solutions can lead to overfitting. Moreover, the best solutions usually don’t interpolate the data in noisy data scenarios. Practical training often stops before reaching full interpolation to avoid entering unstable regions or spiky, non-robust solutions.

    Researchers from UC Santa Barbara, Technion, and UC San Diego explore the generalization of two-layer ReLU neural networks in 1D nonparametric regression with noisy labels. They present a new theory showing that gradient descent with a fixed learning rate converges to local minima representing smooth, sparsely linear functions. These solutions, which do not interpolate, avoid overfitting and achieve near-optimal mean squared error (MSE) rates. Their analysis highlights that large learning rates induce implicit sparsity and that ReLU networks can generalize well even without explicit regularization or early stopping. This theory moves beyond traditional kernel and interpolation frameworks.

    In overparameterized neural networks, most research focuses on generalization within the interpolation regime and benign overfitting. These typically require explicit regularization or early stopping to handle noisy labels. However, recent findings indicate that gradient descent with a large learning rate can achieve sparse, smooth functions that generalize well, even without explicit regularization. This method diverges from traditional theories, which rely on interpolation, demonstrating that gradient descent induces an implicit bias resembling L1-regularization. The study also connects to the hypothesis that “flat local minima generalize better” and provides insights into achieving optimal rates in nonparametric regression without weight decay.

    The study addresses the setup and notation for studying generalization in two-layer ReLU neural networks. The model is trained using gradient descent on a dataset with noisy labels, focusing on regression problems. Key concepts include stable local minima, which are twice differentiable and lie within a specific distance from the global minimum. The study also explores the “Edge of Stability” regime, where the Hessian’s largest eigenvalue reaches a critical value related to the learning rate. For nonparametric regression, the target function is from a bounded variation class. The analysis demonstrates that gradient descent cannot find stable interpolating solutions in noisy settings, leading to smoother, non-interpolating functions.

    The study’s main results explore stable solutions for gradient descent (GD) on ReLU neural networks across three aspects. First, it examines the implicit bias of stable solutions in the function space under large learning rates, revealing that they are inherently smoother and simpler. Second, it derives generalization bounds for these solutions in distribution-free and non-parametric regression settings, showing they avoid overfitting. Lastly, the analysis demonstrates that GD achieves optimal rates for estimating bounded variation functions within specific intervals, confirming the effective generalization performance of large learning rate GD solutions even in noisy environments.

    In conclusion, the study explores how gradient descent-trained two-layer ReLU neural networks generalize through the lens of minima stability and the Edge-of-Stability phenomena. It focuses on univariate inputs with noisy labels and shows that gradient descent with a typical learning rate cannot interpolate data. The study demonstrates that local smoothness of the training loss implies a first-order total variation constraint on the neural network’s function, leading to a vanishing generalization gap within the data support’s strict interior. Additionally, these stable solutions achieve near-optimal rates for estimating first-order bounded variation functions under a mild assumption. The simulations validate the findings, showing that large learning rate training induces sparse linear spline fits.

    Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. 

    Join our Telegram Channel and LinkedIn Group.

    If you like our work, you will love our newsletter..

    Don’t Forget to join our 44k+ ML SubReddit

    Excited to share our latest work that shows “Large Step Size Training Cannot Overfit” in univariate ReLU nets https://t.co/eFHXlJE9yT
    For the first time, we understand how *flatness*, *edge-of-stability* and *large stepsize* imply (near-optimal) generalization. 1/

    — Yu-Xiang Wang (@yuxiangw_cs) June 13, 2024

    The post Generalization of Gradient Descent in Over-Parameterized ReLU Networks: Insights from Minima Stability and Large Learning Rates appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleThis AI Paper from China Proposes Continuity-Relativity indExing with gAussian Middle (CREAM): A Simple yet Effective AI Method to Extend the Context of Large Language Models
    Next Article Microsoft Researchers Introduce Samba 3.8B: A Simple Mamba+Sliding Window Attention Architecture that Outperforms Phi3-mini on Major Benchmarks

    Related Posts

    Machine Learning

    LLMs Struggle with Real Conversations: Microsoft and Salesforce Researchers Reveal a 39% Performance Drop in Multi-Turn Underspecified Tasks

    May 17, 2025
    Machine Learning

    This AI paper from DeepSeek-AI Explores How DeepSeek-V3 Delivers High-Performance Language Modeling by Minimizing Hardware Overhead and Maximizing Computational Efficiency

    May 17, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Mouse hover action to browser window close button

    Development

    LayerSkip: An End-to-End AI Solution to Speed-Up Inference of Large Language Models (LLMs)

    Development

    Community News: Latest PEAR Releases (12.23.2024)

    Development

    CVE-2025-32883 – goTenna Mesh RCE via Software Defined Radio

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    Development

    Acceptance Criteria – Part 2 of User Story

    May 10, 2024

    This blog is the second post in a series about leveraging user stories to improve…

    State-Sponsored Hackers Likely Accessed Employee Emails in British Columbia Government Network Cyberattack

    June 4, 2024

    Revolutionizing knowledge management: VW’s AI prototype journey with AWS

    November 21, 2024

    Otto Hats | Wholesale Hat Supplier | Wholesale Hat Manufacturers | Promotional Hats

    August 9, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.