Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Bytedance AI Research Releases FullStack Bench and SandboxFusion: Comprehensive Benchmarking Tools for Evaluating LLMs in Real-World Programming Scenarios

    Bytedance AI Research Releases FullStack Bench and SandboxFusion: Comprehensive Benchmarking Tools for Evaluating LLMs in Real-World Programming Scenarios

    December 8, 2024

    Code intelligence has grown rapidly, driven by advancements in large language models (LLMs). These models are increasingly utilized for automated programming tasks such as code generation, debugging, and testing. With capabilities spanning multiple languages and domains, LLMs have become crucial tools in advancing software development, data science, and computational problem-solving. The evolution of LLMs is transforming how complex programming tasks are approached and executed.

    One significant area for improvement in the current landscape is the need for comprehensive benchmarks that accurately reflect real-world programming demands. Existing evaluation datasets, such as HumanEval, MBPP, and DS-1000, are often narrowly focused on specific domains, like advanced algorithms or machine learning, failing to capture the diversity required for full-stack programming. Moreover, these datasets could be more extensive in assessing the multilingual and domain-spanning capabilities necessary for real-world software development. This gap poses a major obstacle to effectively measuring and advancing LLM performance.

    Researchers from ByteDance Seed and M-A-P have introduced FullStack Bench, a benchmark that evaluates LLMs across 11 distinct application domains and supports 16 programming languages. The benchmark includes data analysis, desktop and web development, machine learning, and multimedia. Further, they developed SandboxFusion, a unified execution environment that automates code execution and evaluation in multiple languages. These tools aim to provide a holistic framework for testing LLMs in real-world scenarios and overcoming the limitations of existing benchmarks.

    The FullStack Bench dataset contains 3,374 problems, each accompanied by unit test cases, reference solutions, and easy, medium, and hard difficulty classifications. Problems were curated using a combination of human expertise and LLM-assisted processes, ensuring diversity and quality in question design. SandboxFusion supports the execution of FullStack Bench problems by enabling secure, isolated execution environments that accommodate the requirements of different programming languages and dependencies. It supports 23 programming languages, providing a scalable and versatile solution for benchmarking LLMs on datasets beyond FullStack Bench, including popular benchmarks like HumanEval and MBPP.

    The researchers conducted extensive experiments to evaluate the performance of various LLMs on FullStack Bench. Results revealed marked differences in performance across domains and programming languages. For example, while some models demonstrated strong basic programming and data analysis capabilities, others needed help with multimedia and operating system-related tasks. Pass@1, the primary evaluation metric, varied across domains, highlighting models’ challenges in adapting to diverse and complex programming tasks. SandboxFusion proved to be a robust and efficient evaluation tool, significantly outperforming existing execution environments in supporting a wide range of programming languages and dependencies.

    Scaling laws were also analyzed, showing that increasing parameters generally improves model performance. However, researchers observed a performance decline for some models at higher scales. For example, the Qwen2.5-Coder series peaked at 14B parameters but showed a drop in performance at 32B and 72B. This finding underscores the importance of balancing model size and efficiency in optimizing LLM performance. Researchers observed a positive correlation between code compilation pass rates and test success rates, emphasizing the need for precise and error-free code generation.

    The FullStack Bench and SandboxFusion collectively represent significant advancements in evaluating LLMs. By addressing the limitations of existing benchmarks, these tools enable a more comprehensive assessment of LLM capabilities across diverse domains and programming languages. This research lays the groundwork for further innovations in code intelligence and emphasizes the importance of developing tools that accurately reflect real-world programming scenarios.


    Check out the Paper, FullStack Bench, and SandboxFusion. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 60k+ ML SubReddit.

    🚨 [Must Attend Webinar]: ‘Transform proofs-of-concept into production-ready AI applications and agents’ (Promoted)

    The post Bytedance AI Research Releases FullStack Bench and SandboxFusion: Comprehensive Benchmarking Tools for Evaluating LLMs in Real-World Programming Scenarios appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleOBS Studio Update Adds New Features, Drops 22.04 Support
    Next Article Stability AI Releases Arabic Stable LM 1.6B Base and Chat Models: A State-of-the-Art Arabic-Centric LLMs

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 16, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2022-4363 – Wholesale Market WooCommerce CSRF Vulnerability

    May 16, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Apple Intelligence could be “vaporware” tangled in WWDC’s 2024 deceitful concept videos — never to see the light of day

    News & Updates

    This new Windows 11 feature would be perfect for an Xbox gaming handheld

    News & Updates

    Free Nextjs Landing Page Templates & Examples

    Web Development

    11 Vibe Coding Tools to 10x Your Development on Linux Desktop

    Linux
    GetResponse

    Highlights

    CVE-2025-4714 – Campcodes Sales and Inventory System SQL Injection Vulnerability

    May 15, 2025

    CVE ID : CVE-2025-4714

    Published : May 15, 2025, 7:15 p.m. | 45 minutes ago

    Description : A vulnerability was found in Campcodes Sales and Inventory System 1.0. It has been classified as critical. Affected is an unknown function of the file /pages/reprint.php. The manipulation of the argument sid leads to sql injection. It is possible to launch the attack remotely. The exploit has been disclosed to the public and may be used.

    Severity: 7.3 | HIGH

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    Quick Hit #9

    August 12, 2024

    The ultimate guide to developer happiness

    August 13, 2024

    Popular Linux distro for WSL is almost here, and you can help test it out TODAY

    February 17, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.