Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 19, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 19, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 19, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 19, 2025

      Computex

      May 19, 2025

      DOOM: The Dark Ages gets Path Tracing update in June, bringing better visuals for PC players

      May 19, 2025

      Early Memorial Day deals are LIVE on Windows PCs, gaming accessories, and more — 6 hand-picked discounts on our favorites

      May 19, 2025

      Microsoft open sources the Windows Subsystem for Linux — invites developers to help more seamlessly integrate Linux with Windows

      May 19, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      How JavaScript’s at() method makes array indexing easier

      May 19, 2025
      Recent

      How JavaScript’s at() method makes array indexing easier

      May 19, 2025

      Motherhood and Career Balance in Tech: Stories from Perficient LATAM

      May 19, 2025

      ES6: Set Vs Array- What and When?

      May 19, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Computex

      May 19, 2025
      Recent

      Computex

      May 19, 2025

      DOOM: The Dark Ages gets Path Tracing update in June, bringing better visuals for PC players

      May 19, 2025

      Early Memorial Day deals are LIVE on Windows PCs, gaming accessories, and more — 6 hand-picked discounts on our favorites

      May 19, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Omni-R1: Advancing Audio Question Answering with Text-Driven Reinforcement Learning and Auto-Generated Data

    Omni-R1: Advancing Audio Question Answering with Text-Driven Reinforcement Learning and Auto-Generated Data

    May 19, 2025

    Recent developments have shown that RL can significantly enhance the reasoning abilities of LLMs. Building on this progress, the study aims to improve Audio LLMs—models that process audio and text to perform tasks like question answering. The MMAU benchmark is a widely used dataset designed to evaluate these models, featuring multiple-choice questions on sounds, speech, and music, some of which require external knowledge. A prior approach, R1-AQA, used GRPO (Group Relative Policy Optimization) to fine-tune the Qwen2-Audio model on the AVQA dataset, achieving state-of-the-art (SOTA) results on MMAU. Inspired by this, the authors applied GRPO to fine-tune Qwen2.5-Omni-7B, a newer multimodal model, further improving performance. Additionally, they introduced a method to automatically generate audio QA data, leading to even better outcomes.

    Compared to methods like SARI, which uses a more complex mix of supervised fine-tuning and RL with structured reasoning, the authors’ approach is simpler, relying solely on RL without explicit reasoning steps. They also conducted experiments with text-only inputs to investigate the role of GRPO in performance gains. Surprisingly, fine-tuning the models using just text data yielded nearly the same improvements as training with audio and text. This finding suggests that GRPO primarily enhances the model’s reasoning ability through text, significantly contributing to its improved performance in audio QA tasks. 

    Researchers from MIT CSAIL, Goethe University, IBM Research, and others introduce Omni-R1, a fine-tuned version of the multi-modal LLM Qwen2.5-Omni using the GRPO reinforcement learning method. Trained on the AVQA dataset, Omni-R1 sets new state-of-the-art results on the MMAU benchmark across all audio categories. Surprisingly, much of the improvement stems from enhanced text-based reasoning rather than audio input. Fine-tuning with text-only data also led to notable performance gains. Additionally, the team generated large-scale audio QA datasets using ChatGPT, further boosting accuracy. Their work highlights the significant impact of text reasoning in audio LLM performance and promises the public release of all resources. 

    The Omni-R1 model fine-tunes Qwen2.5-Omni using the GRPO reinforcement learning method with a simple prompt format that allows direct answer selection, making it memory-efficient for 48GB GPUs. GRPO avoids a value function by comparing grouped outputs using a reward based solely on answer correctness. Researchers used audio captions from Qwen-2 Audio to expand training data and prompted ChatGPT to generate new question-answer pairs. This method produced two datasets—AVQA-GPT and VGGS-GPT—covering 40k and 182k audios, respectively. Training on these automatically generated datasets improved performance, with VGGS-GPT helping Omni-R1 achieve state-of-the-art accuracy on the MMAU benchmark. 

    The researchers fine-tuned Qwen2.5-Omni using GRPO on AVQA, AVQA-GPT, and VGGS-GPT datasets. Results show notable performance gains, with the best average score of 71.3% on the MAU Test-mini from VGGS-GPT. Qwen2.5-Omni outperformed baselines, including SARI, and showed strong reasoning even without audio, suggesting robust text-based understanding. GRPO fine-tuning improved Qwen2-Audio more significantly due to its weaker initial text reasoning. Surprisingly, fine-tuning without audio boosted performance, while text-only datasets like ARC-Easy yielded comparable results. Improvements mainly stem from enhanced text reasoning, though audio-based fine-tuning remains slightly superior for optimal performance.

    In conclusion, Omni-R1 is an Audio LLM developed by fine-tuning Qwen2.5-Omni using the GRPO reinforcement learning method for enhanced audio question answering. Omni-R1 achieves new state-of-the-art results on the MMAU benchmark across sounds, speech, music, and overall performance. Two new large-scale datasets, AVQA-GPT and VGGS-GPT, were created using automatically generated questions, further boosting model accuracy. Experiments show that GRPO mainly enhances text-based reasoning, significantly contributing to performance. Surprisingly, fine-tuning with only text (without audio) improved audio-based performance, highlighting the value of strong base language understanding. These findings offer cost-effective strategies for developing audio-capable language models. 


    Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit.

    The post Omni-R1: Advancing Audio Question Answering with Text-Driven Reinforcement Learning and Auto-Generated Data appeared first on MarkTechPost.

    Source: Read More 

    Hostinger
    Facebook Twitter Reddit Email Copy Link
    Previous ArticleCVE-2025-3078 – “Xerox Printer Passback Vulnerability”
    Next Article This AI Paper from Microsoft Introduces a DiskANN-Integrated System: A Cost-Effective and Low-Latency Vector Search Using Azure Cosmos DB

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    May 19, 2025
    Machine Learning

    HERE Technologies boosts developer productivity with new generative AI-powered coding assistant

    May 19, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    This excellent designer imagined what Windows Phone would look like in 2024, and it makes me sad

    Development

    New HIPAA Security Rule Would Strengthen Healthcare Cybersecurity

    Development

    Google AI Introduces Patchscopes: A Machine Learning Approach that Trains LLMs to Provide Natural Language Explanations of Their Hidden Representations

    Development

    Social Media Lying to You – The Dark Algorithm Conspiracy Revealed!

    Artificial Intelligence

    Highlights

    Spring Sale BOGO deal: Get 2 PNY laptop SSDs for the price of 1 at Amazon

    March 28, 2025

    Need to upgrade your laptop’s storage, or looking to set up a dual-drive gaming laptop?…

    Rilasciato Archinstall 3.0.3: l’installer di Arch Linux migliora partizionamento e bootloader

    March 31, 2025

    The Unexpected Roles of Web Designers

    February 5, 2025

    How to fetch data from PostgreSQL using PHP

    November 7, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.