Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      AI and its impact on the developer experience, or ‘where is the joy?’

      July 23, 2025

      Google launches OSS Rebuild tool to improve trust in open source packages

      July 23, 2025

      AI-enabled software development: Risk of skill erosion or catalyst for growth?

      July 23, 2025

      BrowserStack launches Figma plugin for detecting accessibility issues in design phase

      July 22, 2025

      Power bank slapped with a recall? Stop using it now – here’s why

      July 23, 2025

      I recommend these budget earbuds over pricier Bose and Sony models – here’s why

      July 23, 2025

      Microsoft’s big AI update for Windows 11 is here – what’s new

      July 23, 2025

      Slow internet speed on Linux? This 30-second fix makes all the difference

      July 23, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Singleton and Scoped Container Attributes in Laravel 12.21

      July 23, 2025
      Recent

      Singleton and Scoped Container Attributes in Laravel 12.21

      July 23, 2025

      wulfheart/laravel-actions-ide-helper

      July 23, 2025

      lanos/laravel-cashier-stripe-connect

      July 23, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      ‘Wuchang: Fallen Feathers’ came close to fully breaking me multiple times — a soulslike as brutal and as beautiful as it gets

      July 23, 2025
      Recent

      ‘Wuchang: Fallen Feathers’ came close to fully breaking me multiple times — a soulslike as brutal and as beautiful as it gets

      July 23, 2025

      Sam Altman is “terrified” of voice ID fraudsters embracing AI — and threats of US bioweapon attacks keep him up at night

      July 23, 2025

      NVIDIA boasts a staggering $111 million in market value per employee — since it became the world’s first $4 trillion company

      July 23, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»TikTok Researchers Introduce SWE-Perf: The First Benchmark for Repository-Level Code Performance Optimization

    TikTok Researchers Introduce SWE-Perf: The First Benchmark for Repository-Level Code Performance Optimization

    July 22, 2025

    Introduction

    As large language models (LLMs) advance in software engineering tasks—ranging from code generation to bug fixing—performance optimization remains an elusive frontier, especially at the repository level. To bridge this gap, researchers from TikTok and collaborating institutions have introduced SWE-Perf—the first benchmark specifically designed to evaluate the ability of LLMs to optimize code performance in real-world repositories.

    Unlike prior benchmarks focused on correctness or function-level efficiency (e.g., SWE-Bench, Mercury, EFFIBench), SWE-Perf captures the complexity and contextual depth of repository-scale performance tuning. It provides a reproducible, quantitative foundation to study and improve the performance optimization capabilities of modern LLMs.

    Image source: https://arxiv.org/abs/2507.12415

    Why SWE-Perf Is Needed

    Real-world codebases are often large, modular, and intricately interdependent. Optimizing them for performance requires understanding of cross-file interactions, execution paths, and computational bottlenecks—challenges beyond the scope of isolated function-level datasets.

    LLMs today are largely evaluated on tasks like syntax correction or small function transformations. But in production environments, performance tuning across repositories can yield more substantial system-wide benefits. SWE-Perf is explicitly built to measure LLM capabilities in such settings.

    Image source: https://arxiv.org/abs/2507.12415

    Dataset Construction

    SWE-Perf is constructed from over 100,000 pull requests across high-profile GitHub repositories. The final dataset covered 9 repositories including:

    • 140 curated instances demonstrating measurable and stable performance improvements.
    • Complete codebases pre- and post-optimization.
    • Target functions categorized as oracle (file-level) or realistic (repo-level).
    • Unit tests and Docker environments for reproducible execution and performance measurement.
    • Expert-authored patches used as gold standards.

    To ensure validity, each unit test must:

    1. Pass before and after the patch.
    2. Show statistically significant runtime gains over 20 repetitions (Mann-Whitney U test, p < 0.1).

    Performance is measured using minimum performance gain (δ), isolating statistical improvements attributable to the patch while filtering noise.

    Benchmark Settings: Oracle vs. Realistic

    • Oracle Setting: The model receives only the target functions and corresponding files. This setting tests localized optimization skills.
    • Realistic Setting: The model is given an entire repository and must identify and optimize performance-critical paths autonomously. This is a closer analog to how human engineers work.

    Evaluation Metrics

    SWE-Perf defines a three-tier evaluation framework, reporting each metric independently:

    1. Apply: Can the model-generated patch be applied cleanly?
    2. Correctness: Does the patch preserve functional integrity (all unit tests pass)?
    3. Performance: Does the patch yield measurable runtime improvement?

    The metrics are not aggregated into a single score, allowing more nuanced evaluation of tradeoffs between syntactic correctness and performance gains.

    Experimental Results

    The benchmark evaluates several top-tier LLMs under both oracle and realistic settings:

    Model Setting Performance (%)
    Claude-4-opus Oracle 1.28
    GPT-4o Oracle 0.60
    Gemini-2.5-Pro Oracle 1.48
    Claude-3.7 (Agentless) Realistic 0.41
    Claude-3.7 (OpenHands) Realistic 2.26
    Expert (Human Patch) – 10.85

    Notably, even the best-performing LLM configurations fall significantly short of human-level performance. The agent-based method OpenHands, built on Claude-3.7-sonnet, outperforms other configurations in the realistic setting but still lags behind expert-crafted optimizations.

    Key Observations

    • Agent-based frameworks like OpenHands are better suited for complex, multi-step optimization, outperforming direct model prompts and pipeline-based approaches like Agentless.
    • Performance degrades as the number of target functions increases—LLMs struggle with broader optimization scopes.
    • LLMs exhibit limited scalability in long-runtime scenarios, where expert systems continue to show performance gains.
    • Patch analysis shows LLMs focus more on low-level code structures (e.g., imports, environment setup), while experts target high-level semantic abstractions for performance tuning.

    Conclusion

    SWE-Perf represents a pivotal step toward measuring and improving the performance optimization capabilities of LLMs in realistic software engineering workflows. It uncovers a significant capability gap between existing models and human experts, offering a strong foundation for future research in repository-scale performance tuning. As LLMs evolve, SWE-Perf can serve as a north star guiding them toward practical, production-ready software enhancement at scale.


    Check out the Paper, GitHub Page and Project. All credit for this research goes to the researchers of this project.

    Sponsorship Opportunity: Reach the most influential AI developers in US and Europe. 1M+ monthly readers, 500K+ community builders, infinite possibilities. [Explore Sponsorship]

    The post TikTok Researchers Introduce SWE-Perf: The First Benchmark for Repository-Level Code Performance Optimization appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleThis AI Paper from Alibaba Introduces Lumos-1: A Unified Autoregressive Video Generator Leveraging MM-RoPE and AR-DF for Efficient Spatiotemporal Modeling
    Next Article Allen Institute for AI-Ai2 Unveils AutoDS: A Bayesian Surprise-Driven Engine for Open-Ended Scientific Discovery

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 23, 2025
    Machine Learning

    FastVLM: Efficient Vision Encoding for Vision Language Models

    July 23, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    Your Android phone is getting a new security secret weapon – how it works

    News & Updates

    CVE-2025-6752 – Linksys UPnP Stack-Based Buffer Overflow Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Windows 0-Day Vulnerability Exploited in the Wild to Deploy Play Ransomware

    Security

    CISA Warns of Chrome 0-Day Vulnerability Exploited in the Wild to Execute Arbitrary Code

    Security

    Highlights

    Linux

    Ubuntu 25.10 “Questing Quokka”: Apertura dello sviluppo e anticipazioni

    May 4, 2025

    Ubuntu è una delle distribuzioni GNU/Linux più popolari e longeve, sviluppata dalla società Canonical. Ogni…

    CVE-2025-6697 – LabRedesCefetRJ WeGIA Cross Site Scripting Vulnerability

    June 26, 2025

    CVE-2025-1418 – Konsola Proget Profile Information Disclosure Vulnerability

    May 21, 2025

    CVE-2025-48701 – OpenDCIM SQL Injection Vulnerability

    May 23, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.