Whiteboard-of-Thought (WoT) Prompting: A Simple AI Approach to Enhance the Visual Reasoning Abilities of MLLMs Across Modalities

Large language models (LLMs) have transformed natural language processing (NLP) by demonstrating the effectiveness of increasing the number of parameters and training data for various reasoning tasks. One successful method, chain-of-thought (CoT) prompting, helps language models solve complex problems by breaking them into intermediate steps written as text before giving the final answer, focusing on tasks like arithmetic and symbolic reasoning. This poses an important question: can LLMs tackle tasks that humans solve using visual thinking? Research shows that even the best LLMs perform badly on tasks having visual and spatial reasoning.

To address these shortcomings, this paper discusses various existing approaches. The first approach is Intermediate reasoning for language models, in which the success of chain-of-thought (CoT) in arithmetic and symbolic reasoning tasks has attracted interest from the NLP community and beyond. The next approach is Tool usage and code augmentation. This approach is compared to using whiteboards, focusing on improving a language model with additional computation, in which a text buffer trained on Python execution traces is used. The last method is Visual and spatial reasoning in LLMs and MLLMs, where the limited success of these models on tasks requiring visual and spatial reasoning is noted. The ability of these models to connect knowledge from text to other areas, like vision, is still debated.

Researchers from Columbia University have proposed Whiteboard-of-Thought (WoT) prompting, a simple approach to enhance the visual reasoning abilities of MLLMs (multimodal large language models) across modalities. WoT prompting provides MLLMs a metaphorical â€˜whiteboardâ€™ where they can draw out reasoning steps as images and then return these images to the model for further processing. This method works without showing examples or special modules, using the modelsâ€™ existing ability to create code with libraries like Matplotlib and Turtle. This simple method achieves state-of-the-art results on four difficult natural language tasks that require visual and spatial reasoning.Â

The main aim of WoT is to give MLLMs the ability to create images and visually process them to answer queries better. Current MLLMs usually do not inherently possess the ability to produce outputs in the visual domain, so, researchers showed how to create visuals using a model that only generates texts. The images created for visual reasoning are minimal, abstract, and symbolic, and such visuals are developed using a natural process of code. Moreover, several scenarios were found where GPT-4o fails badly when using chain-of-thought, even achievingÂ 0% accuracy in some cases. In contrast, WoT can achieve up to 92% accuracy in the same scenarios.

The results of the experiments carried out by researchers show that LLMs using text perform best in a 2D grid setting but may perform badly in other types of geometries. The reason could be because of grid settings:

Being easier to represent as coordinates in text, especially in the form of a simple square.

Having more data available in this format online, such as tabular data, city grids, and 2D maze coding problems.

Humans often write about square grids in text, and grid cells, and use them to navigate physical spaces and map conceptual spaces. This poses interesting questions about how spatial understanding differs between humans and LLMs. The WoT performs consistently across various geometries, eliminating the dependencies on 2D-grid-specific textual knowledge and focusing on the general applications of the approach.Â Â Â

In conclusion, researchers from Columbia University have introduced WoT, a zero-shot method that enables visual reasoning across modalities in MLLMs. This is achieved by generating code that can create a visual, and then returning the visual back to the model for further reasoning. This paper shows WoTâ€™s capabilities across multiple tasks that need visual and spatial understanding, which have been difficult for current state-of-the-art models depending on text reasoning. However, WoT needs accurate vision systems, so future research should aim to improve state-of-the-art MLLMs to understand detailed geometric figures.Â

Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter.Â

Join ourÂ Telegram Channel andÂ LinkedIn Group.

If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 45k+ ML SubReddit

Create, edit, and augment tabular data with the first compound AI system, Gretel Navigator, now generallyÂ available! [Advertisement]

The post Whiteboard-of-Thought (WoT) Prompting: A Simple AI Approach to Enhance the Visual Reasoning Abilities of MLLMs Across Modalities appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

I test a lot of AI coding tools, and this stunning new OpenAI release just saved me days of work

How to use your Android phone as a webcam when your laptop’s default won’t cut it

The 5 most customizable Linux desktop environments – when you want it your way

Gen AI use at work saps our motivation even as it boosts productivity, new research shows

Strategic Cloud Partner: Key to Business Success, Not Just Tech

Strategic Cloud Partner: Key to Business Success, Not Just Tech

Perficient’s “What If? So What?” Podcast Wins Gold at the 2025 Hermes Creative Awards

PIM for Azure Resources

Windows 11 24H2’s Settings now bundles FAQs section to tell you more about your system

Windows 11 24H2’s Settings now bundles FAQs section to tell you more about your system

You can now share an app/browser window with Copilot Vision to help you with different tasks

Microsoft will gradually retire SharePoint Alerts over the next two years

Whiteboard-of-Thought (WoT) Prompting: A Simple AI Approach to Enhance the Visual Reasoning Abilities of MLLMs Across Modalities

Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

CVE-2025-30419 – NI Circuit Design Suite SymbolEditor Out-of-Bounds Read Vulnerability

From caching to real-time analytics: Essential use cases for Amazon ElastiCache for Valkey

CVE-2025-32404 – RT-Labs P-Net OOB Write Vulnerability

2023: A Year of Groundbreaking Advances in AI and Computing

Rilasciata PorteuX 1.8: La Prima Distribuzione GNU/Linux con Xfce 4.20

CVE-2025-4487 – iSourcecode Gym Management System SQL Injection Vulnerability

Is it possible to make a jar file using Selenium java and Eclipse and run that file on any machine for the testing

CNCF Arm64 Pilot: Impact and Insights

Data is the new petroleum; companies need better pipelines — and better oil-spill clean-up methods

Whiteboard-of-Thought (WoT) Prompting: A Simple AI Approach to Enhance the Visual Reasoning Abilities of MLLMs Across Modalities

Related Posts