Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 4, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 4, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 4, 2025

      Smashing Animations Part 4: Optimising SVGs

      June 4, 2025

      I test AI tools for a living. Here are 3 image generators I actually use and how

      June 4, 2025

      The world’s smallest 65W USB-C charger is my latest travel essential

      June 4, 2025

      This Spotlight alternative for Mac is my secret weapon for AI-powered search

      June 4, 2025

      Tech prophet Mary Meeker just dropped a massive report on AI trends – here’s your TL;DR

      June 4, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Beyond AEM: How Adobe Sensei Powers the Full Enterprise Experience

      June 4, 2025
      Recent

      Beyond AEM: How Adobe Sensei Powers the Full Enterprise Experience

      June 4, 2025

      Simplify Negative Relation Queries with Laravel’s whereDoesntHaveRelation Methods

      June 4, 2025

      Cast Model Properties to a Uri Instance in 12.17

      June 4, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      My Favorite Obsidian Plugins and Their Hidden Settings

      June 4, 2025
      Recent

      My Favorite Obsidian Plugins and Their Hidden Settings

      June 4, 2025

      Rilasciata /e/OS 3.0: Nuova Vita per Android Senza Google, Più Privacy e Controllo per l’Utente

      June 4, 2025

      Rilasciata Oracle Linux 9.6: Scopri le Novità e i Miglioramenti nella Sicurezza e nelle Prestazioni

      June 4, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Salesforce AI Research Introduced CodeXEmbed (SFR-Embedding-Code): A Code Retrieval Model Family Achieving #1 Rank on CoIR Benchmark and Supporting 12 Programming Languages

    Salesforce AI Research Introduced CodeXEmbed (SFR-Embedding-Code): A Code Retrieval Model Family Achieving #1 Rank on CoIR Benchmark and Supporting 12 Programming Languages

    January 19, 2025

    Code retrieval has become essential for developers in modern software development, enabling efficient access to relevant code snippets and documentation. Unlike traditional text retrieval, which effectively handles natural language queries, code retrieval must address unique challenges, such as programming languages’ structural variations, dependencies, and contextual relevance. With tools like GitHub Copilot gaining popularity, advanced code retrieval systems are increasingly vital for enhancing productivity and reducing errors.

    Existing retrieval models often struggle to capture programming-specific nuances like syntax, control flow, and variable dependencies. These limitations hinder problem-solving in code summarization, debugging, and translation between languages. While text retrieval models have seen significant advancements, they fail to meet the specific requirements of code retrieval, highlighting the demand for specialized models that improve accuracy and efficiency across diverse programming tasks. Models like CodeBERT, CodeGPT, and UniXcoder have addressed aspects of code retrieval using pre-trained architectures. Still, they are limited in scalability and versatility due to their smaller sizes and task-specific focus. Although Voyage-Code introduced large-scale capabilities, its closed-source nature restricts broader adoption. This highlights the critical need for an open-source, scalable code retrieval system to generalize across multiple tasks.

    Researchers at Salesforce AI Research introduced CodeXEmbed, a family of open-source embedding models specifically designed for code and text retrieval. These models, released in three sizes, SFR-Embedding-Code-400M_R, SFR-Embedding-Code-2B_R, and 7 billion parameters, address various programming languages and retrieval tasks. CodeXEmbed’s innovative training pipeline integrates 12 programming languages and transforms five distinct code retrieval categories into a unified framework. By supporting diverse tasks such as text-to-code, code-to-text, and hybrid retrievals, the model expands the boundaries of what retrieval systems can achieve, offering unprecedented flexibility and performance.

    CodeXEmbed employs an innovative approach that transforms code-related tasks into a unified query-and-answer framework, enabling versatility across various scenarios. Text-to-code retrieval maps natural language queries to relevant code snippets, streamlining tasks like code generation and debugging. Code-to-text retrieval generates explanations and summaries of code, enhancing documentation and knowledge sharing. Hybrid retrieval integrates text and code data, effectively addressing complex queries requiring technical and descriptive insights. The model’s training leverages contrastive loss to optimize query-answer alignment while reducing irrelevant data influence. Advanced techniques like low-rank adaptation and token pooling boost efficiency without sacrificing performance.

    In tests, it has been evaluated across various benchmarks. On the CoIR benchmark, a comprehensive code retrieval evaluation dataset covering 10 subsets and over 2 million entries, the 7-billion parameter model achieved a performance improvement of more than 20% compared to the previous state-of-the-art Voyage-Code model. Notably, the 400-million and 2-billion parameter models also outperformed Voyage-Code, demonstrating the architecture’s scalability across different sizes. Also, CodeXEmbed excelled in text retrieval tasks, with the 7-billion parameter model achieving an average score of 60 on the BEIR benchmark, a suite of 15 datasets covering diverse retrieval tasks such as question answering and fact-checking.

    The models can retrieve code and enhance end-to-end retrieval-augmented generation (RAG) systems. For instance, when applied to repository-level tasks like code completion and issue resolution, the 7-billion parameter model achieved notable results on benchmarks like RepoEval and SWE-Bench-Lite. RepoEval, focusing on repository-level code completion, saw top-1 accuracy improvements when the model retrieved contextually relevant snippets. In SWE-Bench-Lite, a curated dataset for GitHub issue resolution, CodeXEmbed outperformed traditional retrieval systems.

    Key takeaways from the research highlight the contributions and implications of CodeXEmbed in advancing code retrieval:

    1. The 7-billion parameter model achieved state-of-the-art performance, with over 20% improvement on the CoIR benchmark and competitive results on BEIR. It demonstrated versatility across code and text tasks.  
    2. The 400-million and 2-billion parameter models offer practical alternatives for environments where computational resources are limited.  
    3. The models address a broad spectrum of code-related applications by unifying 12 programming languages and five retrieval categories.  
    4. Unlike closed systems such as Voyage-Code, CodeXEmbed promotes community-driven research and innovation.  
    5. Integration with retrieval-augmented generation systems improves outcomes for tasks like code completion and issue resolution.  
    6. Using contrastive loss and token pooling optimizes retrieval accuracy and model adaptability.

    In conclusion, Salesforce’s introduction of the CodeXEmbed family advances code retrieval. These models demonstrate unmatched versatility and scalability by achieving state-of-the-art performance on the CoIR benchmark and excelling in text retrieval tasks. The multilingual and multi-task unified framework, supporting 12 programming languages, positions CodeXEmbed as a pivotal tool for developers and researchers. Its open-source accessibility encourages community-driven innovation while bridging the gap between natural language and code retrieval.


    Check out the Paper, 400M Model, and 2B Model. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

    🚨 Recommend Open-Source Platform: Parlant is a framework that transforms how AI agents make decisions in customer-facing scenarios. (Promoted)

    The post Salesforce AI Research Introduced CodeXEmbed (SFR-Embedding-Code): A Code Retrieval Model Family Achieving #1 Rank on CoIR Benchmark and Supporting 12 Programming Languages appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleMeet OmAgent: A New Python Library for Building Multimodal Language Agents
    Next Article Stanford Researchers Introduce BIOMEDICA: A Scalable AI Framework for Advancing Biomedical Vision-Language Models with Large-Scale Multimodal Datasets

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 4, 2025
    Machine Learning

    A Coding Implementation to Build an Advanced Web Intelligence Agent with Tavily and Gemini AI

    June 4, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    CVE-2025-5146 – Netcore Routerd HTTP Header Handler Command Injection Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    How Game Developers Detect and Prevent Modding and Scripting

    Development

    Improve governance of models with Amazon SageMaker unified Model Cards and Model Registry

    Development

    Part 3: A Survey of Analytics Engineering Work at Netflix

    Development

    Highlights

    CVE-2025-3200: Wiesemann & Theis Com-Server Devices Exposed by Deprecated TLS Protocols

    April 29, 2025

    CVE-2025-3200: Wiesemann & Theis Com-Server Devices Exposed by Deprecated TLS Protocols

    A coordinated security advisory from CERT@VDE and Wiesemann & Theis GmbH has revealed critical vulnerabilities impacting several Wiesemann & Theis products, including the Com-Server++ and related mode …
    Read more

    Published Date:
    Apr 29, 2025 (3 hours, 3 minutes ago)

    Vulnerabilities has been mentioned in this article.

    CVE-2025-3200

    CVE-2025-46617

    CVE-2025-46616

    U.S. Sanctions Chinese Cyber Actors Behind Treasury Breach and Salt Typhoon Attacks

    January 20, 2025

    CVE-2025-32730 – i-PRO Co., Ltd. Surveillance Cameras and Recorders Cryptographic Key Hard-Coded Authentication Bypass

    April 24, 2025

    How to retrieve a text message you lost or deleted on Android

    January 23, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.