Q*: A Versatile Artificial Intelligence AI Approach to Improve LLM Performance in Reasoning Tasks

Large Language Models (LLMs) have demonstrated remarkable abilities in tackling various reasoning tasks expressed in natural language, including math word problems, code generation, and planning. However, as the complexity of reasoning tasks increases, even the most advanced LLMs struggle with errors, hallucinations, and inconsistencies due to their auto-regressive nature. This challenge is particularly evident in tasks requiring multiple reasoning steps, where LLMsâ€™ â€œSystem 1â€ thinkingâ€”fast and instinctive but less accurateâ€”falls short. The need for more deliberative, logical â€œSystem 2â€ thinking becomes crucial for solving complex reasoning problems accurately and consistently.

Several attempts have been made to overcome the challenges faced by LLMs in complex reasoning tasks. Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) aim to align LLM outputs with human expectations. Direct Preference Optimization (DPO) and Aligner methods have also been developed to improve alignment. In the realm of enhancing LLMs with planning capabilities, Tree-of-Thoughts (ToT), A* search, and Monte Carlo Tree Search (MCTS) have been applied. For math reasoning and code generation, techniques such as prompt engineering, fine-tuning with task-specific corpora, and training reward models have been explored. However, these methods often require extensive expertise, significant computational resources, or task-specific modifications, limiting their generalizability and efficiency.

Researchers from Skywork AI and Nanyang Technological University present Q*, a robust framework designed to enhance the multi-step reasoning capabilities of LLMs through deliberative planning. This approach formalizes LLM reasoning as a Markov Decision Process (MDP), where the state combines the input prompt and previous reasoning steps, the action represents the next reasoning step, and the reward measures task success. Q* introduces general methods for estimating optimal Q-values of state-action pairs, including offline reinforcement learning, best sequence selection from rollouts, and completion using stronger LLMs. By framing multi-step reasoning as a heuristic search problem, Q* employs plug-and-play Q-value models as heuristic functions within an A* search framework, guiding LLMs to select the most promising next steps efficiently.

The Q* framework employs a sophisticated architecture to enhance LLMsâ€™ multi-step reasoning capabilities. It formalizes the process as a heuristic search problem, utilizing an A* search algorithm. The framework associates each state with an f-value, computed as a weighted sum of aggregated utility and a heuristic value. The aggregated utility is calculated using a process-based reward function, while the heuristic value is estimated using the optimal Q-value of the state. Q* introduces three methods for estimating optimal Q-values: offline reinforcement learning, learning from rollouts, and approximation using stronger LLMs. These methods enable the framework to learn from training data without task-specific modifications. The deliberative planning process follows an A* search algorithm. It maintains two sets of states: unvisited and visited. The algorithm iteratively selects the state with the highest f-value from the unvisited set, expands it using the LLM policy, and updates both sets accordingly. This process continues until a terminal state (complete trajectory) is reached, at which point the answer is extracted from the final state.

Q* demonstrated significant performance improvements across various reasoning tasks. On the GSM8K dataset, it enhanced Llama-2-7b to achieve 80.8% accuracy, surpassing ChatGPT-turbo. For the MATH dataset, Q* improved Llama-2-7b and DeepSeekMath-7b, reaching 55.4% accuracy, outperforming models like Gemini Ultra (4-shot). In code generation, Q* boosted CodeQwen1.5-7b-Chat to 77.0% accuracy on the MBPP dataset. These results consistently show Q*â€™s effectiveness in enhancing LLM performance across math reasoning and code generation tasks, outperforming traditional methods and some closed-source models.

Q* emerges as an effective method to overcome the challenge of multi-step reasoning in LLMs by introducing a robust deliberation framework. This approach enhances LLMsâ€™ ability to solve complex problems that require in-depth, logical thinking beyond simple auto-regressive token generation. Unlike previous methods that rely on task-specific utility functions, Q* uses a versatile Q-value model trained solely on ground-truth data, making it easily adaptable to various reasoning tasks without modifications. The framework employs plug-and-play Q-value models as heuristic functions, guiding LLMs effectively without the need for task-specific fine-tuning, thus preserving performance across diverse tasks. Q*â€™s agility stems from its single-step consideration approach, contrasting with more computationally intensive methods like MCTS. Extensive experiments in math reasoning and code generation demonstrate Q*â€™s superior performance, highlighting its potential to improve LLMsâ€™ complex problem-solving capabilities significantly.

Check out the Paper. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter.Â

Join ourÂ Telegram Channel andÂ LinkedIn Group.

If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 45k+ ML SubReddit

Create, edit, and augment tabular data with the first compound AI system, Gretel Navigator, now generallyÂ available! [Advertisement]

The post Q*: A Versatile Artificial Intelligence AI Approach to Improve LLM Performance in Reasoning Tasks appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

I need to see more from Lenovo’s most affordable gaming desktop, because this isn’t good enough

Gears of War: Reloaded — Release date, price, and everything you need to know

I’ve been using the Logitech MX Master 3S’ gaming-influenced alternative, and it could be your next mouse

Your Android devices are getting several upgrades for free – including a big one for Auto

YTConverter™ lets you download YouTube videos/audio cleanly via terminal — especially great for Termux users.

YTConverter™ lets you download YouTube videos/audio cleanly via terminal — especially great for Termux users.

NodeSource N|Solid Runtime Release – May 2025: Performance, Stability & the Final Update for v18

Big Changes at Meteor Software: Our Next Chapter

I need to see more from Lenovo’s most affordable gaming desktop, because this isn’t good enough

I need to see more from Lenovo’s most affordable gaming desktop, because this isn’t good enough

Gears of War: Reloaded — Release date, price, and everything you need to know

I’ve been using the Logitech MX Master 3S’ gaming-influenced alternative, and it could be your next mouse

Q*: A Versatile Artificial Intelligence AI Approach to Improve LLM Performance in Reasoning Tasks

Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

CVE-2025-4873 – PHPGurukul News Portal SQL Injection Vulnerability

Will Apple’s new AI capture the “most valuable company” title from Microsoft or inspire its own Recall privacy PR nightmare?

Error’d: Just Beastly

Good design is structure for the work

Integrating Design And Code With Native Design Tokens In Penpot

gdcalc – financial, scientific, statistical and programming calculator

Autonomous Navigation for Aerial Vehicles at Night

Tutorial on Seamlessly Accessing Any LinkedIn Profile with exa-mcp-server and Claude Desktop Using the Model Context Protocol MCP

How to Combat AI Bot Traffic on Your Website

Q*: A Versatile Artificial Intelligence AI Approach to Improve LLM Performance in Reasoning Tasks

Related Posts