Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      The Ultimate Guide to Node.js Development Pricing for Enterprises

      July 29, 2025

      Stack Overflow: Developers’ trust in AI outputs is worsening year over year

      July 29, 2025

      Web Components: Working With Shadow DOM

      July 28, 2025

      Google’s new Opal tool allows users to create mini AI apps with no coding required

      July 28, 2025

      5 preinstalled apps you should delete from your Samsung phone immediately

      July 30, 2025

      Ubuntu Linux lagging? Try my 10 go-to tricks to speed it up

      July 30, 2025

      How I survived a week with this $130 smartwatch instead of my Garmin and Galaxy Ultra

      July 30, 2025

      YouTube is using AI to verify your age now – and if it’s wrong, that’s on you to fix

      July 30, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Time-Controlled Data Processing with Laravel LazyCollection Methods

      July 30, 2025
      Recent

      Time-Controlled Data Processing with Laravel LazyCollection Methods

      July 30, 2025

      Create Apple Wallet Passes in Laravel

      July 30, 2025

      The Laravel Idea Plugin is Now FREE for PhpStorm Users

      July 30, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      New data shows Xbox is utterly dominating PlayStation’s storefront — accounting for 60% of the Q2 top 10 game sales spots

      July 30, 2025
      Recent

      New data shows Xbox is utterly dominating PlayStation’s storefront — accounting for 60% of the Q2 top 10 game sales spots

      July 30, 2025

      Opera throws Microsoft to Brazil’s watchdogs for promoting Edge as your default browser — “Microsoft thwarts‬‭ browser‬‭ competition‬‭‬‭ at‬‭ every‬‭ turn”

      July 30, 2025

      Activision once again draws the ire of players for new Diablo Immortal marketing that appears to have been made with generative AI

      July 30, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»LLMs Can Now Learn without Labels: Researchers from Tsinghua University and Shanghai AI Lab Introduce Test-Time Reinforcement Learning (TTRL) to Enable Self-Evolving Language Models Using Unlabeled Data

    LLMs Can Now Learn without Labels: Researchers from Tsinghua University and Shanghai AI Lab Introduce Test-Time Reinforcement Learning (TTRL) to Enable Self-Evolving Language Models Using Unlabeled Data

    April 23, 2025

    Despite significant advances in reasoning capabilities through reinforcement learning (RL), most large language models (LLMs) remain fundamentally dependent on supervised data pipelines. RL frameworks such as RLHF have pushed model alignment and instruction-following performance but rely heavily on human feedback and labeled datasets. As LLMs are increasingly applied in dynamic environments—ranging from educational settings to scientific workflows—they are required to generalize beyond curated training data.

    However, existing models often exhibit performance gaps when confronted with distribution shifts or novel reasoning tasks. While techniques like Test-Time Scaling (TTS) and Test-Time Training (TTT) have been proposed to mitigate this, the absence of reliable reward signals during inference poses a core challenge for deploying RL in unsupervised settings.

    Test-Time Reinforcement Learning (TTRL): Leveraging Model Priors for Self-Adaptation

    Researchers from Tsinghua University and Shanghai AI Lab introduced Test-Time Reinforcement Learning (TTRL). TTRL is a training framework that applies RL during inference, using only unlabeled test data. It leverages the intrinsic priors of pre-trained language models to estimate pseudo-rewards through majority voting across sampled outputs.

    Instead of relying on explicit labels, TTRL constructs reward functions by aggregating multiple model-generated responses to a given query. A consensus answer, obtained via majority voting, is treated as a pseudo-label. Model responses that align with this pseudo-label are positively reinforced. This formulation transforms test-time inference into an adaptive, self-supervised learning process, allowing LLMs to improve over time without additional supervision.

    TTRL has a two-stage approach:

    • Label Estimation via Majority Voting: For each prompt, the model samples multiple outputs. The most frequent prediction is treated as the estimated label.
    • Reward Assignment and Policy Optimization: A binary reward is assigned based on whether each sampled response matches the estimated label. The model is updated using gradient-based RL algorithms (e.g., PPO or GRPO) to maximize agreement with the pseudo-labels.

    This approach is notable for its simplicity and compatibility with standard RL methods. The reward function, though approximate, provides sufficient learning signal when aggregated over multiple samples. Experimental setups used temperature-controlled sampling (typically temperature = 1.0), with 64 samples for voting and 16 subsampled responses for training updates. No ground-truth labels are involved at any stage.

    Empirical Findings across Mathematical Reasoning Tasks

    TTRL was evaluated on three mathematical benchmarks: AIME 2024, AMC, and MATH-500. The results are consistent across both smaller and larger models:

    • For Qwen2.5-Math-7B, performance on AIME 2024 increased from 16.7% to 43.3% (pass@1), an improvement of 159.3% without any labeled data.
    • On average, across the three benchmarks, the same model achieved a relative gain of 84.1%.
    • Notably, even a smaller model, Qwen2.5-Math-1.5B, improved from 33.0% to 80.0% on MATH-500.

    These gains demonstrate that TTRL supports model improvement even in the absence of supervised training signals. Moreover, TTRL often outperforms the upper bound implied by its own training signal—i.e., the accuracy of the majority-voted predictions. This suggests a self-reinforcing learning loop that can extract richer supervision from noisy consensus signals.

    Additional analyses showed that TTRL generalizes beyond the dataset it was applied to. When trained on one benchmark and evaluated on others, performance improvements persisted. This cross-task transfer indicates that TTRL does not lead to narrow overfitting but supports broader generalization.

    Conclusion: Toward Self-Adaptive and Label-Free Learning

    TTRL represents a novel shift in how reinforcement learning can be applied to LLMs in real-world settings. By reusing the model’s own generations as a proxy for supervision, it removes the need for expensive human annotations while enabling continual adaptation. The approach scales naturally with model size, is compatible with different RL algorithms, and shows promising robustness across tasks of varying difficulty.

    While this study focuses on mathematical reasoning, the underlying ideas—self-estimated supervision, test-time adaptation, and reinforcement learning without labels—may generalize to other domains. As language models increasingly encounter tasks beyond their pre-training distribution, frameworks like TTRL offer a scalable path forward.

    Further exploration is needed to understand the theoretical convergence properties of TTRL and to evaluate its applicability in interactive or multi-agent scenarios. Nonetheless, TTRL provides a technically sound and computationally efficient foundation for enabling LLMs to evolve continuously from their own outputs.


    Check out the Paper and GitHub Page. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

    🔥 [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

    The post LLMs Can Now Learn without Labels: Researchers from Tsinghua University and Shanghai AI Lab Introduce Test-Time Reinforcement Learning (TTRL) to Enable Self-Evolving Language Models Using Unlabeled Data appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleMuon Optimizer Significantly Accelerates Grokking in Transformers: Microsoft Researchers Explore Optimizer Influence on Delayed Generalization
    Next Article Open-Source TTS Reaches New Heights: Nari Labs Releases Dia, a 1.6B Parameter Model for Real-Time Voice Cloning and Expressive Speech Synthesis on Consumer Device

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 29, 2025
    Machine Learning

    Amazon Develops an AI Architecture that Cuts Inference Time 30% by Activating Only Relevant Neurons

    July 29, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Cisco Patches Two Vulnerabilities in CCP and ISE: Proof-of-Concept Exploits Publicly Available

    Security

    Can AI Replace Web Developers: A Practical Look at Current Tools and Limitations

    Development
    TorchSim: A Next-Generation PyTorch-Native Atomistic Simulation Engine for the MLIP Era

    TorchSim: A Next-Generation PyTorch-Native Atomistic Simulation Engine for the MLIP Era

    Machine Learning

    Effective cross-lingual LLM evaluation with Amazon Bedrock

    Machine Learning

    Highlights

    TCC Bypass vulnerabilities in two macOS applications

    June 20, 2025

    TCC Bypass vulnerabilities in two macOS applications

    CVE ID
    CVE-2025-5255
    Publication date
    20 June 2025
    Vendor
    Core.ai
    Product
    Phoenix Code
    Vulnerable versions
    All through 4.0.3
    Vulnerability type (CWE)
    Incorrect Default Permissions (CWE-276)
    Report sou …
    Read more

    Published Date:
    Jun 20, 2025 (3 hours, 58 minutes ago)

    Vulnerabilities has been mentioned in this article.

    CVE-2025-5963

    CVE-2025-5255

    CISA Releases ICS Advisories Covering Vulnerabilities & Exploits

    June 4, 2025

    CrushFTP Servers Hit by Critical Zero-Day Vulnerability CVE-2025-54309

    July 21, 2025

    CVE-2025-7067 – HDF5 Heap-Based Buffer Overflow

    July 4, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.