Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      June 2, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 2, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 2, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 2, 2025

      How Red Hat just quietly, radically transformed enterprise server Linux

      June 2, 2025

      OpenAI wants ChatGPT to be your ‘super assistant’ – what that means

      June 2, 2025

      The best Linux VPNs of 2025: Expert tested and reviewed

      June 2, 2025

      One of my favorite gaming PCs is 60% off right now

      June 2, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      `document.currentScript` is more useful than I thought.

      June 2, 2025
      Recent

      `document.currentScript` is more useful than I thought.

      June 2, 2025

      Adobe Sensei and GenAI in Practice for Enterprise CMS

      June 2, 2025

      Over The Air Updates for React Native Apps

      June 2, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      You can now open ChatGPT on Windows 11 with Win+C (if you change the Settings)

      June 2, 2025
      Recent

      You can now open ChatGPT on Windows 11 with Win+C (if you change the Settings)

      June 2, 2025

      Microsoft says Copilot can use location to change Outlook’s UI on Android

      June 2, 2025

      TempoMail — Command Line Temporary Email in Linux

      June 2, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Singapore University of Technology and Design (SUTD) Explores Advancements and Challenges in Multimodal Reasoning for AI Models Through Puzzle-Based Evaluations and Algorithmic Problem-Solving Analysis

    Singapore University of Technology and Design (SUTD) Explores Advancements and Challenges in Multimodal Reasoning for AI Models Through Puzzle-Based Evaluations and Algorithmic Problem-Solving Analysis

    February 8, 2025

    After the success of large language models (LLMs), the current research extends beyond text-based understanding to multimodal reasoning tasks. These tasks integrate vision and language, which is essential for artificial general intelligence (AGI). Cognitive benchmarks such as PuzzleVQA and AlgoPuzzleVQA evaluate AI’s ability to process abstract visual information and algorithmic reasoning. Even with advancements, LLMs struggle with multimodal reasoning, particularly pattern recognition and spatial problem-solving. High computational costs compound these challenges.

    Prior evaluations relied on symbolic benchmarks such as ARC-AGI and visual assessments like Raven’s Progressive Matrices. However, these do not adequately challenge AI’s ability to process multimodal inputs. Recently, datasets like PuzzleVQA and AlgoPuzzleVQA have been introduced to assess abstract visual reasoning and algorithmic problem-solving. These datasets require models integrating visual perception, logical deduction, and structured reasoning. While previous models, such as GPT-4-Turbo and GPT-4o, demonstrated improvements, they still faced limitations in abstract reasoning and multimodal interpretation.

    Researchers from the Singapore University of Technology and Design (SUTD) introduced a systematic evaluation of OpenAI’s GPT-[n] and o-[n] model series on multimodal puzzle-solving tasks. Their study examined how reasoning capabilities evolved across different model generations. The research aimed to identify gaps in AI’s perception, abstract reasoning, and problem-solving skills. The team compared the performance of models such as GPT-4-Turbo, GPT-4o, and o1 on PuzzleVQA and AlgoPuzzleVQA datasets, including abstract visual puzzles and algorithmic reasoning challenges.

    The researchers conducted a structured evaluation using two primary datasets: 

    1. PuzzleVQA: PuzzleVQA focuses on abstract visual reasoning and requires models to recognize patterns in numbers, shapes, colors, and sizes. 
    2. AlgoPuzzleVQA: AlgoPuzzleVQA presents algorithmic problem-solving tasks that demand logical deduction and computational reasoning. 

    The evaluation was carried out using both multiple-choice and open-ended question formats. The study employed a zero-shot Chain of Thought (CoT) prompting for reasoning and analyzed the performance drop when switching from multiple-choice to open-ended responses. The models were also tested under conditions where visual perception and inductive reasoning were separately provided to diagnose specific weaknesses.  

    The study observed steady improvements in reasoning capabilities across different model generations. GPT-4o showed better performance than GPT-4-Turbo, while o1 achieved the most notable advancements, particularly in algorithmic reasoning tasks. However, these gains came with a sharp increase in computational cost. Despite overall progress, AI models still struggled with tasks that required precise visual interpretation, such as recognizing missing shapes or deciphering abstract patterns. While o1 performed well in numerical reasoning, it had difficulty handling shape-based puzzles. The difference in accuracy between multiple-choice and open-ended tasks indicated a strong dependence on answer prompts. Also, perception remained a major challenge across all models, with accuracy improving significantly when explicit visual details were provided.

    In a quick recap, the work can be summarized in a few detailed points:

    1. The study observed a significant upward trend in reasoning capabilities from GPT-4-Turbo to GPT-4o and o1. While GPT-4o showed moderate gains, the transition to o1 resulted in notable improvements but came at a 750x increase in computational cost compared to GPT-4o.  
    2. Across PuzzleVQA, o1 achieved an average accuracy of 79.2% in multiple-choice settings, surpassing GPT-4o’s 60.6% and GPT-4-Turbo’s 54.2%. However, in open-ended tasks, all models exhibited performance drops, with o1 scoring 66.3%, GPT-4o at 46.8%, and GPT-4-Turbo at 38.6%.  
    3. In AlgoPuzzleVQA, o1 substantially improved over previous models, particularly in puzzles requiring numerical and spatial deduction. o1 scored 55.3%, compared to GPT-4o’s 43.6% and GPT-4-Turbo’s 36.5% in multiple-choice tasks. However, its accuracy declined by 23.1% in open-ended tasks.  
    4. The study identified perception as the primary limitation across all models. Injecting explicit visual details improved accuracy by 22%–30%, indicating a reliance on external perception aids. Inductive reasoning guidance further boosted performance by 6%–19%, particularly in numerical and spatial pattern recognition.  
    5. o1 excelled in numerical reasoning but struggled with shape-based puzzles, showing a 4.5% drop compared to GPT-4o in shape recognition tasks. Also, it performed well in structured problem-solving but faced challenges in open-ended scenarios requiring independent deduction.

    Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

    🚨 Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)

    The post Singapore University of Technology and Design (SUTD) Explores Advancements and Challenges in Multimodal Reasoning for AI Models Through Puzzle-Based Evaluations and Algorithmic Problem-Solving Analysis appeared first on MarkTechPost.

    Source: Read More 

    Hostinger
    Facebook Twitter Reddit Email Copy Link
    Previous ArticleIBM AI Releases Granite-Vision-3.1-2B: A Small Vision Language Model with Super Impressive Performance on Various Tasks
    Next Article Process Reinforcement through Implicit Rewards (PRIME): A Scalable Machine Learning Framework for Enhancing Reasoning Capabilities

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 2, 2025
    Machine Learning

    MiMo-VL-7B: A Powerful Vision-Language Model to Enhance General Visual Understanding and Multimodal Reasoning

    June 2, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Salesforce AI Research Proposes DEI: AI Software Engineering Agents Org, Achieving a 34.3% Resolve Rate on SWE-Bench Lite, Crushing Closed-Source Systems

    Development

    Meta AI Introduces CLUE (Constitutional MLLM JUdgE): An AI Framework Designed to Address the Shortcomings of Traditional Image Safety Systems

    Machine Learning

    Belarusian Government-Linked Threat Actor ‘UNC1151’ Targets Ukraine’s Ministry of Defense

    Development

    Warren Sponholtz, Veteran IT Leader, Becomes Florida’s State CIO Focusing

    Development

    Highlights

    Development

    UX in Universal Design Series: The Importance of Customizable Gestures in Health Systems – 6

    August 22, 2024

    Welcome to another installment of our UX in Universal Design series! In this edition, we’ll…

    CVE-2025-5297 – SourceCodester Computer Store System Stack-Based Buffer Overflow Vulnerability

    May 28, 2025

    Choose the right throughput strategy for Amazon DynamoDB applications

    April 21, 2025

    CmdCompass – collect, learn, recall terminal commands

    January 29, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.