Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 31, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 31, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 31, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 31, 2025

      Windows 11 version 25H2: Everything you need to know about Microsoft’s next OS release

      May 31, 2025

      Elden Ring Nightreign already has a duos Seamless Co-op mod from the creator of the beloved original, and it’ll be “expanded on in the future”

      May 31, 2025

      I love Elden Ring Nightreign’s weirdest boss — he bargains with you, heals you, and throws tantrums if you ruin his meditation

      May 31, 2025

      How to install SteamOS on ROG Ally and Legion Go Windows gaming handhelds

      May 31, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Oracle Fusion new Product Management Landing Page and AI (25B)

      May 31, 2025
      Recent

      Oracle Fusion new Product Management Landing Page and AI (25B)

      May 31, 2025

      Filament Is Now Running Natively on Mobile

      May 31, 2025

      How Remix is shaking things up

      May 30, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Windows 11 version 25H2: Everything you need to know about Microsoft’s next OS release

      May 31, 2025
      Recent

      Windows 11 version 25H2: Everything you need to know about Microsoft’s next OS release

      May 31, 2025

      Elden Ring Nightreign already has a duos Seamless Co-op mod from the creator of the beloved original, and it’ll be “expanded on in the future”

      May 31, 2025

      I love Elden Ring Nightreign’s weirdest boss — he bargains with you, heals you, and throws tantrums if you ruin his meditation

      May 31, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Salesforce AI Introduces TACO: A New Family of Multimodal Action Models that Combine Reasoning with Real-World Actions to Solve Complex Visual Tasks

    Salesforce AI Introduces TACO: A New Family of Multimodal Action Models that Combine Reasoning with Real-World Actions to Solve Complex Visual Tasks

    January 13, 2025

    Developing effective multi-modal AI systems for real-world applications requires handling diverse tasks such as fine-grained recognition, visual grounding, reasoning, and multi-step problem-solving. Existing open-source multi-modal language models are found to be wanting in these areas, especially for tasks that involve external tools such as OCR or mathematical calculations. The abovementioned limitations can largely be attributed to single-step-oriented datasets that cannot provide a coherent framework for multiple steps of reasoning and logical chains of actions. Overcoming these will be indispensable for unlocking true potential in using multi-modal AI on complex levels.

    Current multi-modal models often rely on instruction tuning with direct-answer datasets or few-shot prompting approaches. The proprietary systems, like GPT-4, have demonstrated the ability to reason through CoTA chains effectively. At the same time, the open-source models face challenges from a lack of datasets and integration with tools. The earlier efforts, such as LLaVa-Plus and Visual Program Distillation, were also limited by small dataset sizes, poor-quality training data, and a focus on simple question-answering tasks, which limits them to more complex multi-modal problems that require more sophisticated reasoning and tool application.

    Researchers from the University of Washington and Salesforce Research have developed TACO, an innovative framework for training multi-modal action models using synthetic CoTA datasets. This work introduces several key advancements to address the limitations of prior methods. First, over 1.8 million traces have been generated using GPT-4 and programs in Python, while a subset of 293K examples was curated to present high quality after rigorous filtering techniques. These examples ensure the inclusion of diverse reasoning and action sequences critical for multi-modal learning. Second, TACO incorporates a robust set of 15 tools, including OCR, object localization, and mathematical solvers, enabling the models to handle complex tasks effectively. Third, advanced filtering and data-mixing techniques further optimize the dataset, emphasizing reasoning-action integration and fostering superior learning outcomes. This framework reinterprets multi-modal learning by enabling models to produce coherent multi-step reasoning while seamlessly integrating actions, thereby establishing a new benchmark for performance in intricate scenarios.

    The development of TACO involved training on a carefully curated CoTA dataset with 293K instances sourced from 31 different origins, including Visual Genome. These datasets contain a wide range of tasks such as mathematical reasoning, optical character recognition, and detailed visual understanding. It is highly heterogeneous, with the tools provided including object localization and language-based solvers that allow a wide range of reasoning and action tasks. The training architecture combined LLaMA3 as the linguistic basis with CLIP as the visual encoder thus establishing a strong multi-modal framework. Fine-tuning established hyperparameter tuning that focused on lowering learning rates and increasing the number of epochs for training to guarantee that the models could adequately solve complex multi-modal challenges.

    TACO’s performance across eight benchmarks demonstrates its substantial impact on advancing multi-modal reasoning capabilities. The system achieved an average accuracy improvement of 3.6% over instruction-tuned baselines, with gains as high as 15% on MMVet tasks that involve OCR and mathematical reasoning. Notably, the high-quality 293K CoTA dataset outperformed larger, less refined datasets, underscoring the importance of targeted data curation. Further performance improvements are achieved through adjustments in hyperparameter strategies, including tuning of vision encoders and optimization of learning rates. Table 2: Results show excellent performances by TACO compared to benchmarks, the latter was found to be exceptionally better in complex tasks involving integration of reasoning and action.

    TACO introduces a new methodology for multi-modal action modeling that effectively addresses the severe deficiencies of both reasoning and tool-based actions through high-quality synthetic datasets and innovative training methodologies. The research overcomes the constraints of traditional instruction-tuned models, and its developments are poised to change the face of real-world applications ranging from visual question answering to complex multi-step reasoning tasks.


    Check out the Paper, GitHub Page, and Project Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

    🚨 FREE UPCOMING AI WEBINAR (JAN 15, 2025): Boost LLM Accuracy with Synthetic Data and Evaluation Intelligence–Join this webinar to gain actionable insights into boosting LLM model performance and accuracy while safeguarding data privacy.

    The post Salesforce AI Introduces TACO: A New Family of Multimodal Action Models that Combine Reasoning with Real-World Actions to Solve Complex Visual Tasks appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleF3OS – Debian-based Linux distribution
    Next Article Meta AI Introduces CLUE (Constitutional MLLM JUdgE): An AI Framework Designed to Address the Shortcomings of Traditional Image Safety Systems

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    May 31, 2025
    Machine Learning

    Cisco’s Latest AI Agents Report Details the Transformative Impact of Agentic AI on Customer Experience

    May 31, 2025
    Leave A Reply Cancel Reply

    Hostinger

    Continue Reading

    Microsoft 365’s best apps are about to get a speed boost — here’s when the rollout begins

    News & Updates

    How to compose JavaScript functions that take multiple parameters (the epic guide)

    Development

    “‘Enhanced’ is pushing it.” New STALKER remasters launch to ‘Mostly Negative’ Steam reviews as players bemoan blurry visuals, controversial changes

    News & Updates

    Unable to remove orphan mailbox from Outlook

    Operating Systems

    Highlights

    Artificial Intelligence

    What Is General Ledger Reconciliation?

    May 6, 2024

    General Ledger Reconciliation The General Ledger (GL) is a silent custodian of a company’s financial…

    Attackers Exploit Public .env Files to Breach Cloud and Social Media Accounts

    August 16, 2024

    Lenovo Legion Go S (SteamOS) vs Steam Deck: Which is the better Steam gaming handheld?

    January 7, 2025

    How to become a self-taught developer while supporting a family [Podcast #164]

    March 16, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.