Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      GitHub’s CEO Thomas Dohmke steps down, triggering tighter integration of company within Microsoft

      August 12, 2025

      bitHuman launches SDK for creating AI avatars

      August 12, 2025

      Designing With AI, Not Around It: Practical Advanced Techniques For Product Design Use Cases

      August 11, 2025

      Why Companies Are Investing in AI-Powered React.js Development Services in 2025

      August 11, 2025

      I found a Google Maps alternative that won’t track you or drain your battery – and it’s free

      August 12, 2025

      I tested this new AI podcast tool to see if it can beat NotebookLM – here’s how it did

      August 12, 2025

      Microsoft’s new update makes your taskbar a productivity hub – here’s how

      August 12, 2025

      Save $50 on the OnePlus Pad 3 plus get a free gift – here’s the deal

      August 12, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Laravel Global Scopes: Automatic Query Filtering

      August 12, 2025
      Recent

      Laravel Global Scopes: Automatic Query Filtering

      August 12, 2025

      Building MCP Servers in PHP

      August 12, 2025

      Filament v4 is Stable!

      August 12, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      I Asked OpenAI’s New Open-Source AI Model to Complete a Children’s School Test — Is It Smarter Than a 10-Year-Old?

      August 12, 2025
      Recent

      I Asked OpenAI’s New Open-Source AI Model to Complete a Children’s School Test — Is It Smarter Than a 10-Year-Old?

      August 12, 2025

      Madden NFL 26 Leads This Week’s Xbox Drops—But Don’t Miss These Hidden Gems

      August 12, 2025

      ASUS G14 Bulked Up for 2025—Still Sexy, Just a Bit Chonkier

      August 12, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»This AI Paper Investigates Test-Time Scaling of English-Centric RLMs for Enhanced Multilingual Reasoning and Domain Generalization

    This AI Paper Investigates Test-Time Scaling of English-Centric RLMs for Enhanced Multilingual Reasoning and Domain Generalization

    May 14, 2025

    Reasoning language models, or RLMs, are increasingly used to simulate step-by-step problem-solving by generating long, structured reasoning chains. These models break down complex questions into simpler parts and build logical steps to reach answers. This chain-of-thought (CoT) approach has proven effective in improving output quality, especially in mathematical and logical tasks. Despite multilingual capabilities in many modern large models, the focus of research and training has remained largely centered on English, leaving a gap in understanding how well these reasoning skills translate to other languages.

    One major challenge is that most RLMs are fine-tuned on English data, which limits their ability to reason effectively in other languages. This becomes especially problematic for low-resource languages that have limited training examples. The models may default to English thinking patterns, producing lower-quality outputs when prompted in another language. Furthermore, differences in language structure can cause reasoning errors, particularly when a model trained in one language is expected to infer logic in another without adequate linguistic alignment.

    Current techniques employ zero-shot or few-shot prompting strategies to manage these limitations, often using English as a pivot language. Some efforts involve presenting prompts in the same language as the query to preserve linguistic consistency. However, small models have minimal benefits due to limited capacity, and even large models show inconsistent performance when reasoning in low-resource languages. Despite multilingual pretraining, the gap between the training and reasoning language continues to hinder accurate multilingual reasoning.

    The Brown University and MBZUAI research team focused on evaluating how increasing test-time computation, particularly through extended reasoning chains, can affect the multilingual reasoning abilities of English-centric RLMs. They investigated using s1 models based on the Qwen2.5-Instruct architecture and fine-tuned on 1,000 English STEM reasoning samples. These models were tested across various languages using benchmarks like MGSM and Global-MMLU to answer four core questions: the effectiveness of crosslingual test-time scaling, language-mixing behaviors, performance under language-forcing, and cross-domain generalization.

    In-depth experiments showed that models with more parameters significantly benefited from increased test-time thinking tokens. The 14B s1 model, when scaled to 8,000 thinking tokens, achieved an average accuracy of 81% across non-English languages in MGSM. It outperformed models like Qwen2.5-14B-Instruct by +23.1% in French and +41.6% in Swahili. Even though the model was trained only in English, its performance surpassed that of larger models such as DeepSeek’s R1-Distill-Qwen-32B in several high-resource languages. The study also found that reasoning in high-resource languages like Chinese and English is more efficient, requiring fewer tokens and delivering better results than in low-resource languages like Swahili or Telugu.

    A key observation was the “quote-and-think” behavior, where the model quoted non-English phrases from prompts and reasoned in English. This consistent pattern across languages like Japanese and Russian suggested that the model used its multilingual understanding to interpret non-English input without direct translation. Language-forcing experiments further confirmed that forcing reasoning in high-resource languages yielded better results, while strict reasoning in low-resource languages led to significant accuracy drops and computational inefficiencies.

    Despite strong results in STEM-related tasks, performance gains did not transfer to domains like cultural commonsense or humanities. In benchmarks like FORK, increasing thinking tokens sometimes reduced performance, indicating overthinking. The study concludes that while test-time scaling enhances multilingual reasoning in high-resource languages, it does not generalize effectively to out-of-domain tasks or low-resource languages, indicating the need for further research on balanced multilingual training and domain adaptation.


    Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit.

    Here’s a brief overview of what we’re building at Marktechpost:

    • ML News Community – r/machinelearningnews (92k+ members)
    • Newsletter– airesearchinsights.com/(30k+ subscribers)
    • miniCON AI Events – minicon.marktechpost.com
    • AI Reports & Magazines – magazine.marktechpost.com
    • AI Dev & Research News – marktechpost.com (1M+ monthly readers)
    • Partner with us

    The post This AI Paper Investigates Test-Time Scaling of English-Centric RLMs for Enhanced Multilingual Reasoning and Domain Generalization appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleCVE-2024-13940 – Ninja Forms Webhooks SSRF Vulnerability
    Next Article This Isn’t Supposed to Happen: Troubleshooting the Impossible

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    August 12, 2025
    Machine Learning

    Interspeech 2025

    August 12, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Samsung Galaxy Z Flip 7 vs. Z Flip 6: I used both models, and there’s a clear winner

    News & Updates

    The best online photo printing services of 2025: Expert tested and reviewed

    News & Updates

    Darcula-Suite: AI Revolutionizes Phishing-as-a-Service Operations

    Security

    CVE-2025-38341 – Linux Kernel Eth fbnic Double Free Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    CVE-2025-39454 – Jeroen Peters Name Directory Missing Authorization Vulnerability

    May 19, 2025

    CVE ID : CVE-2025-39454

    Published : May 19, 2025, 6:15 p.m. | 33 minutes ago

    Description : Missing Authorization vulnerability in Jeroen Peters Name Directory.This issue affects Name Directory: from n/a through 1.30.0.

    Severity: 4.3 | MEDIUM

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    OpenAI to Retain Non-Profit Structure, Focus on Societal Impact

    May 6, 2025

    Windows 10 removes Start menu jump lists (file list) for tiles in April 2025 Update

    April 22, 2025

    CVE-2025-7259 – MongoDB Server Duplicate _id Field Denial of Service

    July 7, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.