Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      June 2, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 2, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 2, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 2, 2025

      The Alters: Release date, mechanics, and everything else you need to know

      June 2, 2025

      I’ve fallen hard for Starsand Island, a promising anime-style life sim bringing Ghibli vibes to Xbox and PC later this year

      June 2, 2025

      This new official Xbox 4TB storage card costs almost as much as the Xbox SeriesXitself

      June 2, 2025

      I may have found the ultimate monitor for conferencing and productivity, but it has a few weaknesses

      June 2, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      May report 2025

      June 2, 2025
      Recent

      May report 2025

      June 2, 2025

      Write more reliable JavaScript with optional chaining

      June 2, 2025

      Deploying a Scalable Next.js App on Vercel – A Step-by-Step Guide

      June 2, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      The Alters: Release date, mechanics, and everything else you need to know

      June 2, 2025
      Recent

      The Alters: Release date, mechanics, and everything else you need to know

      June 2, 2025

      I’ve fallen hard for Starsand Island, a promising anime-style life sim bringing Ghibli vibes to Xbox and PC later this year

      June 2, 2025

      This new official Xbox 4TB storage card costs almost as much as the Xbox SeriesXitself

      June 2, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»ToolHop: A Novel Dataset Designed to Evaluate LLMs in Multi-Hop Tool Use Scenarios

    ToolHop: A Novel Dataset Designed to Evaluate LLMs in Multi-Hop Tool Use Scenarios

    January 11, 2025

    Multi-hop queries have always given LLM agents a hard time with their solutions, necessitating multiple reasoning steps and information from different sources. They are crucial for analyzing a model’s comprehension, reasoning, and function-calling capabilities. At this time when new large models are booming every other day with claims of unparalleled capabilities, multi-hop tools realistically assess them by bestowing with a complex query, which the model needs to decompose into atomic parts and iteratively solve by invoking and utilizing appropriate tools. Furthermore, multi-hop tool evaluation has emerged as pivotal for advancing models toward generalized intelligence.

    Existing works in this field fall short of offering a reliable evaluation method. Methods proposed until now have relied on tool-driven data construction methods where queries are simulated for a given collection of tools. This shortfall points out the loophole in ensuring the interdependence of collected tools and assessing the multi-hop reasoning. Additionally, the absence of verifiable answers introduces model bias and evaluation errors. This article discusses the latest research that presents a reliable method to honestly assess the multi-hop capabilities of a large language model.

    Fudan University and ByteDance researchers presented ToolHop, a dataset designed explicitly for multi-hop tool evaluation with 995 rigorously designed user queries and 3,912 associated tools. Toolhop claims to solve all the aforementioned problems through diverse queries, locally executable tools, meaningful interdependencies, detailed feedback, and verifiable answers. The authors propose a novel query-driven data construction approach that could expand a single multi-hop query into a comprehensive multi-hop tool use test case.

    The proposed novel scheme comprises three key stages: tool creation, document refinement, and code generation.

    Tool Creation:    A preliminary set of tool documents is created per the user-provided multi-hop query. The document is designed to keep it interdependent and relevant by resolving queries into atomic parts and individually handling each. This way, the document captures the essence of the query and structures itself to generate similar queries, ensuring modularity and cohesion.

    Document Refinement: The prepared tool document undergoes comprehensive filtering to support the evaluation of models in complex multi-hop scenarios. Here, new features like result filtering and customizable formats are introduced to expand functionality while maintaining originality. Parallelly, the number of parameters is increased, and their types are optimized.

    Code Generation: At this stage, locally executable functions are generated by the prepared tool. Through these functions, tools are externally invoked, enabling seamless multi-turn interactions between the model and tools.

    The research team implemented the approach with the queries drawn from the MoreHopQA dataset. Further, to ensure the evaluation with ToolHop, a rigorous five-dimensional analysis was done. ToolHop was then evaluated on fourteen LLMs from five families, including open and closed-sourced models. The evaluation method was so designed that answer correctness and minimized invocation errors were ensured. The authors observed that using tools increased the models’ performance by up to 12 % on average and by up to 23 % for GPT models. The best-performing model could achieve 49.04% answer correctness even after the increase. Also, despite using tools in response to multi-hop queries, models hallucinated around 10% of the time.

    Conclusion: 

    This paper presents a comprehensive dataset for solving multi-hop queries using specially designed queries and tools. The main finding from the experiments was that while LLMs have significantly enhanced their ability to solve complex multi-shop queries with the use of tools, their multi-shop tool use capabilities still leave considerable room for improvement.


    Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

    🚨 FREE UPCOMING AI WEBINAR (JAN 15, 2025): Boost LLM Accuracy with Synthetic Data and Evaluation Intelligence–Join this webinar to gain actionable insights into boosting LLM model performance and accuracy while safeguarding data privacy.

    The post ToolHop: A Novel Dataset Designed to Evaluate LLMs in Multi-Hop Tool Use Scenarios appeared first on MarkTechPost.

    Source: Read More 

    Hostinger
    Facebook Twitter Reddit Email Copy Link
    Previous ArticleSepLLM: A Practical AI Approach to Efficient Sparse Attention in Large Language Models
    Next Article ProVision: A Scalable Programmatic Approach to Vision-Centric Instruction Data for Multimodal Language Models

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 2, 2025
    Machine Learning

    Off-Policy Reinforcement Learning RL with KL Divergence Yields Superior Reasoning in Large Language Models

    June 2, 2025
    Leave A Reply Cancel Reply

    Hostinger

    Continue Reading

    Rilasciato Fastfetch 2.41: Tutte le novità del tool per mostrare le informazioni dei sistemi GNU/Linux

    Linux

    TextMagic: Revolutionizing Text Editing Beyond HTML Input and TextArea!

    Development

    10 Best Free Social Media Plugins for WordPress

    Learning Resources

    Distribution Release: Plamo Linux 8.2

    News & Updates
    Hostinger

    Highlights

    Artificial Intelligence

    AI model deciphers the code in proteins that tells them where to go

    February 13, 2025

    Proteins are the workhorses that keep our cells running, and there are many thousands of…

    AVerMedia Elite Go GC313Pro review: This little charging brick is a must-have for streamers on the go

    January 16, 2025

    The best advice to bring to work in 2025

    January 8, 2025

    CVE-2025-36557 – F5 Big-IP HTTP Enforce RFC Compliance Remote Denial of Service

    May 7, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.