Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      10 Top Node.js Development Companies for Enterprise-Scale Projects (2025-2026 Ranked & Reviewed)

      July 4, 2025

      12 Must-Know Cost Factors When Hiring Node.js Developers for Your Enterprise

      July 4, 2025

      Mirantis reveals Lens Prism, an AI copilot for operating Kubernetes clusters

      July 3, 2025

      Avoid these common platform engineering mistakes

      July 3, 2025

      Just days after joining Game Pass, the Xbox PC edition of Call of Duty: WW2 is taken offline for “an issue”

      July 5, 2025

      Xbox layoffs and game cuts wreak havoc on talented developers and the company’s future portfolio — Weekend discussion 💬

      July 5, 2025

      Microsoft plans to revamp Recall in Windows 11 with these new features

      July 5, 2025

      This 4K OLED monitor has stereo speakers that follow you — but it’s missing something “imPORTant”

      July 5, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Flaget – new small 5kB CLI argument parser

      July 5, 2025
      Recent

      Flaget – new small 5kB CLI argument parser

      July 5, 2025

      The dog days of JavaScript summer

      July 4, 2025

      Databricks Lakebase – Database Branching in Action

      July 4, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Just days after joining Game Pass, the Xbox PC edition of Call of Duty: WW2 is taken offline for “an issue”

      July 5, 2025
      Recent

      Just days after joining Game Pass, the Xbox PC edition of Call of Duty: WW2 is taken offline for “an issue”

      July 5, 2025

      Xbox layoffs and game cuts wreak havoc on talented developers and the company’s future portfolio — Weekend discussion 💬

      July 5, 2025

      Microsoft plans to revamp Recall in Windows 11 with these new features

      July 5, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Omni-R1: Advancing Audio Question Answering with Text-Driven Reinforcement Learning and Auto-Generated Data

    Omni-R1: Advancing Audio Question Answering with Text-Driven Reinforcement Learning and Auto-Generated Data

    May 19, 2025

    Recent developments have shown that RL can significantly enhance the reasoning abilities of LLMs. Building on this progress, the study aims to improve Audio LLMs—models that process audio and text to perform tasks like question answering. The MMAU benchmark is a widely used dataset designed to evaluate these models, featuring multiple-choice questions on sounds, speech, and music, some of which require external knowledge. A prior approach, R1-AQA, used GRPO (Group Relative Policy Optimization) to fine-tune the Qwen2-Audio model on the AVQA dataset, achieving state-of-the-art (SOTA) results on MMAU. Inspired by this, the authors applied GRPO to fine-tune Qwen2.5-Omni-7B, a newer multimodal model, further improving performance. Additionally, they introduced a method to automatically generate audio QA data, leading to even better outcomes.

    Compared to methods like SARI, which uses a more complex mix of supervised fine-tuning and RL with structured reasoning, the authors’ approach is simpler, relying solely on RL without explicit reasoning steps. They also conducted experiments with text-only inputs to investigate the role of GRPO in performance gains. Surprisingly, fine-tuning the models using just text data yielded nearly the same improvements as training with audio and text. This finding suggests that GRPO primarily enhances the model’s reasoning ability through text, significantly contributing to its improved performance in audio QA tasks. 

    Researchers from MIT CSAIL, Goethe University, IBM Research, and others introduce Omni-R1, a fine-tuned version of the multi-modal LLM Qwen2.5-Omni using the GRPO reinforcement learning method. Trained on the AVQA dataset, Omni-R1 sets new state-of-the-art results on the MMAU benchmark across all audio categories. Surprisingly, much of the improvement stems from enhanced text-based reasoning rather than audio input. Fine-tuning with text-only data also led to notable performance gains. Additionally, the team generated large-scale audio QA datasets using ChatGPT, further boosting accuracy. Their work highlights the significant impact of text reasoning in audio LLM performance and promises the public release of all resources. 

    The Omni-R1 model fine-tunes Qwen2.5-Omni using the GRPO reinforcement learning method with a simple prompt format that allows direct answer selection, making it memory-efficient for 48GB GPUs. GRPO avoids a value function by comparing grouped outputs using a reward based solely on answer correctness. Researchers used audio captions from Qwen-2 Audio to expand training data and prompted ChatGPT to generate new question-answer pairs. This method produced two datasets—AVQA-GPT and VGGS-GPT—covering 40k and 182k audios, respectively. Training on these automatically generated datasets improved performance, with VGGS-GPT helping Omni-R1 achieve state-of-the-art accuracy on the MMAU benchmark. 

    The researchers fine-tuned Qwen2.5-Omni using GRPO on AVQA, AVQA-GPT, and VGGS-GPT datasets. Results show notable performance gains, with the best average score of 71.3% on the MAU Test-mini from VGGS-GPT. Qwen2.5-Omni outperformed baselines, including SARI, and showed strong reasoning even without audio, suggesting robust text-based understanding. GRPO fine-tuning improved Qwen2-Audio more significantly due to its weaker initial text reasoning. Surprisingly, fine-tuning without audio boosted performance, while text-only datasets like ARC-Easy yielded comparable results. Improvements mainly stem from enhanced text reasoning, though audio-based fine-tuning remains slightly superior for optimal performance.

    In conclusion, Omni-R1 is an Audio LLM developed by fine-tuning Qwen2.5-Omni using the GRPO reinforcement learning method for enhanced audio question answering. Omni-R1 achieves new state-of-the-art results on the MMAU benchmark across sounds, speech, music, and overall performance. Two new large-scale datasets, AVQA-GPT and VGGS-GPT, were created using automatically generated questions, further boosting model accuracy. Experiments show that GRPO mainly enhances text-based reasoning, significantly contributing to performance. Surprisingly, fine-tuning with only text (without audio) improved audio-based performance, highlighting the value of strong base language understanding. These findings offer cost-effective strategies for developing audio-capable language models. 


    Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit.

    The post Omni-R1: Advancing Audio Question Answering with Text-Driven Reinforcement Learning and Auto-Generated Data appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleCVE-2025-3078 – “Xerox Printer Passback Vulnerability”
    Next Article This AI Paper from Microsoft Introduces a DiskANN-Integrated System: A Cost-Effective and Low-Latency Vector Search Using Azure Cosmos DB

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 5, 2025
    Machine Learning

    Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging

    July 4, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2025-40667 – TCMAN’s GIM Missing Authorization Vulnerability (Authorization Bypass)

    Common Vulnerabilities and Exposures (CVEs)

    These tech markets are taking the brunt of the new US tariffs – what that means for you

    News & Updates

    CVE-2025-46827 – Graylog HTML Form Cookie Disclosure

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-53370 – Citizen MediaWiki Cross-Site Scripting (XSS)

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    Machine Learning

    Meta Introduces KernelLLM: An 8B LLM that Translates PyTorch Modules into Efficient Triton GPU Kernels

    May 20, 2025

    Meta has introduced KernelLLM, an 8-billion-parameter language model fine-tuned from Llama 3.1 Instruct, aimed at…

    CVE-2024-46992 – Electron ASAR Integrity Bypass on Windows

    July 1, 2025

    Commvault RCE Vulnerability Let Attackers Breach Vault – PoC Released

    April 24, 2025

    Do these 4 things before betting on AI in your business – and why

    June 27, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.