Microsoft Researchers Propose Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models

Large language models (LLMs) excel in language comprehension and reasoning tasks but lack spatial reasoning exploration, a vital aspect of human cognition. Humans demonstrate remarkable skills in mental imagery, termed the Mindâ€™s Eye, enabling imagination of the unseen world. This capability remains relatively unexplored in LLMs, highlighting a gap in their understanding of spatial concepts and their inability to replicate human-like imagination.

Previous studies have highlighted the remarkable achievements of LLMs in language tasks but underscored their underexplored spatial reasoning abilities. While human cognition relies on spatial reasoning for environmental interaction, LLMs primarily depend on verbal reasoning. Humans augment spatial awareness through mental imagery, enabling tasks like navigation and mental stimulation, a concept extensively studied across neuroscience, philosophy, and cognitive science.

Microsoft researchers propose Visualization-of-Thought (VoT) prompting. It can generate and manipulate mental images similar to the human mindâ€™s eye for spatial reasoning. Through VoT prompting, LLMs utilise a visuospatial sketchpad to visualise reasoning steps, enhancing subsequent spatial reasoning. VoT employs zero-shot prompting, utilising LLMsâ€™ capability to acquire mental images from text-based visual art, instead of relying on few-shot demonstrations or text-to-image techniques with CLIP.

VoT prompts LLMs to generate visualisations after each reasoning step, forming interleaved reasoning traces. Utilising a visuospatial sketchpad tracks the visual state, represented by partial solutions at each step. This mechanism grounds LLMsâ€™ reasoning in the visual context, improving their spatial reasoning abilities within tasks like navigation and tiling.

GPT-4 VoT surpasses other settings across all tasks and metrics, indicating the effectiveness of visual state tracking. Comparisons reveal significant performance gaps, highlighting VoTâ€™s superiority. In the natural language navigation task, GPT-4 VoT outperforms GPT-4 w/o VoT by 27%. Notably, GPT-4 CoT lags behind GPT-4V CoT in visual tasks, suggesting the advantage of grounding LLMs with a 2D grid for spatial reasoning.

The key contributions of this research are the following:

The paper explores LLMsâ€™ mental imagery for spatial reasoning, analysing its nature and constraints while delving into its origin from code pre-training.

It introduces two unique tasks, â€œvisual navigationâ€ and â€œvisual tiling,â€ accompanied by synthetic datasets. These offer diverse sensory inputs for LLMs and varying complexity levels, thereby providing a robust testbed for spatial reasoning research.

The researchers propose VoT prompting, which effectively elicits LLMsâ€™ mental imagery for spatial reasoning, showcasing superior performance compared to other prompting methods and existing multimodal large language models (MLLMs). This capability resembles the human mindâ€™s eye process, implying its potential applicability in enhancing MLLMs.

In conclusion, the research introduces VoT, which mirrors human cognitive function in visualising mental images. VoT empowers LLMs to excel in multi-hop spatial reasoning tasks, surpassing MLLMs in visual tasks. Similar to the mindâ€™s eye process, this capability indicates promise for MLLMs. The findings underscore VoTâ€™s efficacy in enhancing spatial reasoning in LLMs, suggesting its potential to advance multimodal language models.

Check out theÂ Paper.Â All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter.Â Join ourÂ Telegram Channel,Â Discord Channel, andÂ LinkedIn Group.

If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 40k+ ML SubReddit

The post Microsoft Researchers Propose Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

Minecraft licensing robbed us of this controversial NFL schedule release video

The power of generators

The power of generators

Simplify Factory Associations with Laravel’s UseFactory Attribute

This Week in Laravel: React Native, PhpStorm Junie, and more

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

Microsoft Researchers Propose Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models

Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

CVE-2025-47916 – Invision Community Themeeditor Remote Code Execution

How to Build a Real-Time Intrusion Detection System with Python and Open-Source Libraries

Qualcomm scores BIG win against Arm, can continue to sell Snapdragon X chips for PCs

How Twilio generated SQL using Looker Modeling Language data with Amazon Bedrock

Microsoft just fixed this Windows 11 bug, but I bet you wish it hadn’t

CVE-2025-0549 – GitLab OAuth Bypass Vulnerability

CVE-2024-58134 – Mojolicious Default HMAC Session Secret Vulnerability

The designer’s handbook for developer handoff

Russian-Linked Hackers Target Eastern European NGOs and Media

Microsoft Researchers Propose Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models

Related Posts