From Wordle to Robotics: Q-SFT Unleashes LLMsâ€™ Potential in Sequential Decision-Making

Integration of Reinforcement Learning RL with large language models catalyzes LLMâ€™s performance on distinct specialty tasks such as robotics control or natural language processing that require sequential decision-making. Offline RL is one such technique in the spotlight today that works with static datasets without additional engagements. However, despite its utility in single-turn scenarios, Offline RL loses ground with multi-turn sequential applications. Policy Gradient Methods are generally applied to LLMs and VLMs in this case to mitigate the complexity of RL, all while achieving similar accuracy. This goes against the principle of RL as to how a technique that guides small models so well fails on LLMs when there is massive data to learn and dynamically adapt and thus more scope for growth.Â

Research has revealed that the answer to the above riddle lies in the basic blocks. Offline RL performs below the expectations of LLM because of a mismatch between the training objectives of the two. Language models are trained to predict likelihoods, whereas Q learning of RL aims to predict action values. Therefore, offline RL manipulates the trained likelihood to utilize its underlying representations for its own objective during the fine-tuning Model. This manipulation leads to loss of information, language, vision, and even sequence from LLMs.Now that we have discussed the inefficiencies of offline RL and its massive potential with LLMs, we discuss the latest research that proposes measures to mitigate this problem.

Researchers from the UC Berkeley presented in their paper â€œQ-SFT: Q-LEARNING FOR LANGUAGE MODELS VIA SUPERVISED FINE-TUNING â€œa new algorithm that allows unlocking the potential of RL without diminishing the abilities of the Language model. Authors add weights to the traditional supervised fine-tuning objective to learn probabilities that conservatively estimate the value function instead of the behavior policy. They transform the maximum likelihood function into a weighted cross entropy function with weights obtained from Bellman recurrence relations.Â The proposed modification allowed authors to evade unstable regression objectives, all while conserving maximum likelihood from pre-training. This method competes head-to-head with state-of-the-art approaches.

Q -SFT follows a unique method. Instead of going the conventional way of training value functions by fitting Q-values to their Bellman backup target via a regression loss, authors instead fine-tune directly on the probabilities learned from pre-training with proposed loss to ensure that Q values arenâ€™t left behind. Q-SFT provides a way to learn Q values for multi-turn RL problems via supervised learning without reinitializing weights or new heads to represent Q values. Furthermore, the Maximum Likelihood Function put up by authors could be directly initialized from the logits of a pre-trained LLM or VLM.Q-SFT is superior to other supervised learning-based RL algorithms such as filtered behavior cloning or return-conditioned supervised learning.

Q-SFT combined aspects from both Q learning and supervised fine-tuning; therefore, the authors tested it against SOTA from both the individual methods to see if the aggregate Model could reach the individual categories. To assess Q-SFT on offline RL multi-step sequential tasks, authors consolidated various benchmarks where a language model is expected to make sequential decisions. The first set of functions in the assessment consisted of a bunch of games from the LMRL benchmark. Q-SFTâ€™s sequential decisions were tested using Chess, Wordle, and Twenty Questions. Q-SFTÂ outperformed both Prompting and SFT in LLM andÂ Implicit Language Q Learning in RL in all three games. For the next set of tasks, LLMs had to behave as agents and perform interactive web-based tasks that also needed tools. LLM had to purchase products from WebShop. Q-SFT again achieved the highest score relatively. To test the effectiveness of Vision Language Models, authors evaluate the Model on ALFWorld, a complex text-based environment with image observations where the Model performs various complicated tasks. On ALFWorld, LLM superseded in 4 of 6 tasks, and on the remaining 2, performed head-to-head with others. The last task was Robotic Manipulation, where Q-SFT was performed at par SOTA.

Conclusion: Q-SFT improved upon the conventional Offline RL Q systems by learning Q values as probabilities similar to supervised fine-tuning objectives. Q-SFT on Large Language Models outperformed all the strong supervised and value-based RL baselines. It also was at par with the SOTA in the tasks of vision and robots when integrated with VLMs and Robotics Transformers.

Check out the Paper. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter and join ourÂ Telegram Channel andÂ LinkedIn Group. If you like our work, you will love ourÂ newsletter.. Donâ€™t Forget to join ourÂ 55k+ ML SubReddit.

â€˜Evaluation of Large Language Model Vulnerabilities: A Comparative Analysis of Red Teaming Techniquesâ€™ Read the Full Report _(Promoted)

The post From Wordle to Robotics: Q-SFT Unleashes LLMsâ€™ Potential in Sequential Decision-Making appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

How To Fix Largest Contentful Paint Issues With Subpart Analysis

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Prevent WordPress SQL Injection Attacks

Microsoft finally ships controversial Windows 11 ‘Recall’ feature after year-long delay — now rolling out to all Copilot+ PCs

Oblivion Remastered loses the most helpful settings on PC thanks to a botched Game Pass update

Best PC gaming desktops for playing The Elder Scrolls 4: Oblivion Remastered in 2025

As big hitting Xbox first-party titles head to PlayStation 5, which games would you like to see head the other way?

Perficient Included in IDC Market Glance: Healthcare Provider Operational IT Solutions, 1Q25

Perficient Included in IDC Market Glance: Healthcare Provider Operational IT Solutions, 1Q25

Natural Language Q&A in Optimizely CMS Using Azure OpenAI and AI Search

Closing Deals Faster: The Future of Sales with AI & Personalization

Microsoft finally ships controversial Windows 11 ‘Recall’ feature after year-long delay — now rolling out to all Copilot+ PCs

Microsoft finally ships controversial Windows 11 ‘Recall’ feature after year-long delay — now rolling out to all Copilot+ PCs

Oblivion Remastered loses the most helpful settings on PC thanks to a botched Game Pass update

Best PC gaming desktops for playing The Elder Scrolls 4: Oblivion Remastered in 2025

From Wordle to Robotics: Q-SFT Unleashes LLMsâ€™ Potential in Sequential Decision-Making

April 2025 Patch Tuesday: One Zero-Day and 11 Critical Vulnerabilities Among 121 CVEs

SAP Confirms Critical NetWeaver Flaw Amid Suspected Zero-Day Exploitation by Hackers

Efficient Text Processing in Linux: Awk, Cut, Paste

Understanding Debouncing and Throttling in JavaScript â€“ A Comprehensive Guide

TypeScript: leveraging “unknown” instead of “any”

MLB The Show 25’s absence from Game Pass may push the game’s fall

Google AI Researchers Propose a Noise-Aware Training Method (NAT) for Layout-Aware Language Models

Microsoft 365 Copilot’s two new AI agents can speed up your workflow

Building an Interactive Image Grid with Three.js

5 Linux commands for better group management (and how to use them)

From Wordle to Robotics: Q-SFT Unleashes LLMsâ€™ Potential in Sequential Decision-Making

Related Posts