Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      June 2, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 2, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 2, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 2, 2025

      The Alters: Release date, mechanics, and everything else you need to know

      June 2, 2025

      I’ve fallen hard for Starsand Island, a promising anime-style life sim bringing Ghibli vibes to Xbox and PC later this year

      June 2, 2025

      This new official Xbox 4TB storage card costs almost as much as the Xbox SeriesXitself

      June 2, 2025

      I may have found the ultimate monitor for conferencing and productivity, but it has a few weaknesses

      June 2, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      May report 2025

      June 2, 2025
      Recent

      May report 2025

      June 2, 2025

      Write more reliable JavaScript with optional chaining

      June 2, 2025

      Deploying a Scalable Next.js App on Vercel – A Step-by-Step Guide

      June 2, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      The Alters: Release date, mechanics, and everything else you need to know

      June 2, 2025
      Recent

      The Alters: Release date, mechanics, and everything else you need to know

      June 2, 2025

      I’ve fallen hard for Starsand Island, a promising anime-style life sim bringing Ghibli vibes to Xbox and PC later this year

      June 2, 2025

      This new official Xbox 4TB storage card costs almost as much as the Xbox SeriesXitself

      June 2, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Salesforce AI Research Introduced CodeXEmbed (SFR-Embedding-Code): A Code Retrieval Model Family Achieving #1 Rank on CoIR Benchmark and Supporting 12 Programming Languages

    Salesforce AI Research Introduced CodeXEmbed (SFR-Embedding-Code): A Code Retrieval Model Family Achieving #1 Rank on CoIR Benchmark and Supporting 12 Programming Languages

    January 19, 2025

    Code retrieval has become essential for developers in modern software development, enabling efficient access to relevant code snippets and documentation. Unlike traditional text retrieval, which effectively handles natural language queries, code retrieval must address unique challenges, such as programming languages’ structural variations, dependencies, and contextual relevance. With tools like GitHub Copilot gaining popularity, advanced code retrieval systems are increasingly vital for enhancing productivity and reducing errors.

    Existing retrieval models often struggle to capture programming-specific nuances like syntax, control flow, and variable dependencies. These limitations hinder problem-solving in code summarization, debugging, and translation between languages. While text retrieval models have seen significant advancements, they fail to meet the specific requirements of code retrieval, highlighting the demand for specialized models that improve accuracy and efficiency across diverse programming tasks. Models like CodeBERT, CodeGPT, and UniXcoder have addressed aspects of code retrieval using pre-trained architectures. Still, they are limited in scalability and versatility due to their smaller sizes and task-specific focus. Although Voyage-Code introduced large-scale capabilities, its closed-source nature restricts broader adoption. This highlights the critical need for an open-source, scalable code retrieval system to generalize across multiple tasks.

    Researchers at Salesforce AI Research introduced CodeXEmbed, a family of open-source embedding models specifically designed for code and text retrieval. These models, released in three sizes, SFR-Embedding-Code-400M_R, SFR-Embedding-Code-2B_R, and 7 billion parameters, address various programming languages and retrieval tasks. CodeXEmbed’s innovative training pipeline integrates 12 programming languages and transforms five distinct code retrieval categories into a unified framework. By supporting diverse tasks such as text-to-code, code-to-text, and hybrid retrievals, the model expands the boundaries of what retrieval systems can achieve, offering unprecedented flexibility and performance.

    CodeXEmbed employs an innovative approach that transforms code-related tasks into a unified query-and-answer framework, enabling versatility across various scenarios. Text-to-code retrieval maps natural language queries to relevant code snippets, streamlining tasks like code generation and debugging. Code-to-text retrieval generates explanations and summaries of code, enhancing documentation and knowledge sharing. Hybrid retrieval integrates text and code data, effectively addressing complex queries requiring technical and descriptive insights. The model’s training leverages contrastive loss to optimize query-answer alignment while reducing irrelevant data influence. Advanced techniques like low-rank adaptation and token pooling boost efficiency without sacrificing performance.

    In tests, it has been evaluated across various benchmarks. On the CoIR benchmark, a comprehensive code retrieval evaluation dataset covering 10 subsets and over 2 million entries, the 7-billion parameter model achieved a performance improvement of more than 20% compared to the previous state-of-the-art Voyage-Code model. Notably, the 400-million and 2-billion parameter models also outperformed Voyage-Code, demonstrating the architecture’s scalability across different sizes. Also, CodeXEmbed excelled in text retrieval tasks, with the 7-billion parameter model achieving an average score of 60 on the BEIR benchmark, a suite of 15 datasets covering diverse retrieval tasks such as question answering and fact-checking.

    The models can retrieve code and enhance end-to-end retrieval-augmented generation (RAG) systems. For instance, when applied to repository-level tasks like code completion and issue resolution, the 7-billion parameter model achieved notable results on benchmarks like RepoEval and SWE-Bench-Lite. RepoEval, focusing on repository-level code completion, saw top-1 accuracy improvements when the model retrieved contextually relevant snippets. In SWE-Bench-Lite, a curated dataset for GitHub issue resolution, CodeXEmbed outperformed traditional retrieval systems.

    Key takeaways from the research highlight the contributions and implications of CodeXEmbed in advancing code retrieval:

    1. The 7-billion parameter model achieved state-of-the-art performance, with over 20% improvement on the CoIR benchmark and competitive results on BEIR. It demonstrated versatility across code and text tasks.  
    2. The 400-million and 2-billion parameter models offer practical alternatives for environments where computational resources are limited.  
    3. The models address a broad spectrum of code-related applications by unifying 12 programming languages and five retrieval categories.  
    4. Unlike closed systems such as Voyage-Code, CodeXEmbed promotes community-driven research and innovation.  
    5. Integration with retrieval-augmented generation systems improves outcomes for tasks like code completion and issue resolution.  
    6. Using contrastive loss and token pooling optimizes retrieval accuracy and model adaptability.

    In conclusion, Salesforce’s introduction of the CodeXEmbed family advances code retrieval. These models demonstrate unmatched versatility and scalability by achieving state-of-the-art performance on the CoIR benchmark and excelling in text retrieval tasks. The multilingual and multi-task unified framework, supporting 12 programming languages, positions CodeXEmbed as a pivotal tool for developers and researchers. Its open-source accessibility encourages community-driven innovation while bridging the gap between natural language and code retrieval.


    Check out the Paper, 400M Model, and 2B Model. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

    🚨 Recommend Open-Source Platform: Parlant is a framework that transforms how AI agents make decisions in customer-facing scenarios. (Promoted)

    The post Salesforce AI Research Introduced CodeXEmbed (SFR-Embedding-Code): A Code Retrieval Model Family Achieving #1 Rank on CoIR Benchmark and Supporting 12 Programming Languages appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleMeet OmAgent: A New Python Library for Building Multimodal Language Agents
    Next Article Stanford Researchers Introduce BIOMEDICA: A Scalable AI Framework for Advancing Biomedical Vision-Language Models with Large-Scale Multimodal Datasets

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 2, 2025
    Machine Learning

    Off-Policy Reinforcement Learning RL with KL Divergence Yields Superior Reasoning in Large Language Models

    June 2, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Balancing Text In CSS

    Web Development

    How to Eliminate Identity-Based Threats

    Development

    Introducing Hypervel: A Coroutine Framework for Laravel Artisans

    Development

    CVE-2025-4923 – SourceCodester Client Database Management System Unrestricted File Upload Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    Development

    Dynamic Component Rendering in Vue.js: When and How to Use It

    December 26, 2024

    Dynamic component rendering in Vue.js lets you display different components based on conditions at runtime.…

    Transforming Data Management: The Impact of AI-Driven Intelligent Systems

    June 25, 2024

    Sea of Thieves is joining Blizzard Entertainment’s Battle.net with themed cosmetics and Xbox Play Anywhere

    April 14, 2025

    How Far Can Transformers Reason? The Locality Barrier and Inductive Scratchpad

    July 5, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.