Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      June 1, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 1, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 1, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 1, 2025

      My top 5 must-play PC games for the second half of 2025 — Will they live up to the hype?

      June 1, 2025

      A week of hell with my Windows 11 PC really makes me appreciate the simplicity of Google’s Chromebook laptops

      June 1, 2025

      Elden Ring Nightreign Night Aspect: How to beat Heolstor the Nightlord, the final boss

      June 1, 2025

      New Xbox games launching this week, from June 2 through June 8 — Zenless Zone Zero finally comes to Xbox

      June 1, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Student Record Android App using SQLite

      June 1, 2025
      Recent

      Student Record Android App using SQLite

      June 1, 2025

      When Array uses less memory than Uint8Array (in V8)

      June 1, 2025

      Laravel 12 Starter Kits: Definite Guide Which to Choose

      June 1, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      My top 5 must-play PC games for the second half of 2025 — Will they live up to the hype?

      June 1, 2025
      Recent

      My top 5 must-play PC games for the second half of 2025 — Will they live up to the hype?

      June 1, 2025

      A week of hell with my Windows 11 PC really makes me appreciate the simplicity of Google’s Chromebook laptops

      June 1, 2025

      Elden Ring Nightreign Night Aspect: How to beat Heolstor the Nightlord, the final boss

      June 1, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Researchers from USC and Prime Intellect Released METAGENE-1: A 7B Parameter Autoregressive Transformer Model Trained on Over 1.5T DNA and RNA Base Pairs

    Researchers from USC and Prime Intellect Released METAGENE-1: A 7B Parameter Autoregressive Transformer Model Trained on Over 1.5T DNA and RNA Base Pairs

    January 7, 2025

    In a time when global health faces persistent threats from emerging pandemics, the need for advanced biosurveillance and pathogen detection systems is increasingly evident. Traditional genomic analysis methods, while effective in isolated cases, often struggle to address the complexities of large-scale health monitoring. A significant challenge is identifying and understanding the genomic diversity in environments such as wastewater, which contains a rich mix of microbial and viral DNA and RNA. The rapid advancements in biological research have further emphasized the importance of scalable, accurate, and interpretable models to analyze vast amounts of metagenomic data, aiding in the prediction and mitigation of health crises.

    Researchers from the University of Southern California, Prime Intellect, and the Nucleic Acid Observatory have introduced METAGENE-1, a metagenomic foundation model. This 7-billion-parameter autoregressive transformer model is specifically designed to analyze metagenomic sequences. METAGENE-1 is trained on a dataset comprising over 1.5 trillion DNA and RNA base pairs derived from human wastewater samples, utilizing next-generation sequencing technologies and a tailored byte-pair encoding (BPE) tokenization strategy to capture the intricate genomic diversity present in these datasets. The model is open-sourced, encouraging collaboration and further advancements in the field.

    Technical Highlights and Benefits

    METAGENE-1’s architecture draws on modern transformer models, including GPT and Llama families. This decoder-only transformer uses a causal language modeling objective to predict the next token in a sequence based on preceding tokens. Its key features include:

    1. Dataset Diversity: The training data encompasses sequences from tens of thousands of species, representing the microbial and viral diversity found in human wastewater.
    2. Tokenization Strategy: The use of BPE tokenization enables the model to process novel nucleic acid sequences efficiently.
    3. Training Infrastructure: Advanced distributed training setups ensured stable training on large datasets despite hardware limitations.
    4. Applications: METAGENE-1 supports tasks like pathogen detection, anomaly detection, and species classification, making it valuable for metagenomic studies and public health research.

    These features enable METAGENE-1 to generate high-quality sequence embeddings and adapt to specific tasks, enhancing its utility in the genomic and public health domains.

    Results and Insights

    The capabilities of METAGENE-1 were assessed using multiple benchmarks, where it demonstrated notable performance. In a pathogen detection benchmark based on human wastewater samples, the model achieved an average Matthews correlation coefficient (MCC) of 92.96, significantly outperforming other models. Additionally, METAGENE-1 showed strong results in anomaly detection tasks, effectively distinguishing metagenomic sequences from other genomic data sources.

    In embedding-based genomic analyses, METAGENE-1 excelled on the Gene-MTEB benchmark, achieving a global average score of 0.59. This performance underscores its adaptability in both zero-shot and fine-tuning scenarios, reinforcing its value in handling complex and diverse metagenomic data.

    Conclusion

    METAGENE-1 represents a thoughtful integration of artificial intelligence and metagenomics. By leveraging transformer architectures, the model offers practical solutions for biosurveillance and pandemic preparedness. Its open-source release invites researchers to collaborate and innovate, advancing the field of genomic science. As challenges related to emerging pathogens and global pandemics continue, METAGENE-1 demonstrates how technology can play a crucial role in addressing public health concerns effectively and responsibly.


    Check out the Paper, Website, GitHub Page, and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

    🚨 FREE UPCOMING AI WEBINAR (JAN 15, 2025): Boost LLM Accuracy with Synthetic Data and Evaluation Intelligence–Join this webinar to gain actionable insights into boosting LLM model performance and accuracy while safeguarding data privacy.

    The post Researchers from USC and Prime Intellect Released METAGENE-1: A 7B Parameter Autoregressive Transformer Model Trained on Over 1.5T DNA and RNA Base Pairs appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleThis AI Paper from Tel Aviv University Introduces GASLITE: A Gradient-Based Method to Expose Vulnerabilities in Dense Embedding-Based Text Retrieval Systems
    Next Article This AI paper from the Beijing Institute of Technology and Harvard Unveils TXpredict for Predicting Microbial Transcriptomes

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 1, 2025
    Machine Learning

    Enigmata’s Multi-Stage and Mix-Training Reinforcement Learning Recipe Drives Breakthrough Performance in LLM Puzzle Reasoning

    June 1, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    CVE-2025-48079 – Metagauss ProfileGrid Missing Authorization Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    The Oracle of Eldoria: Unveiling the Mysteries of the Twelfth Sense

    Artificial Intelligence

    This Amazon Echo Frames deal comes with an Echo Spot (and it’s cheaper than Meta Ray-Bans)

    News & Updates

    CVE-2025-4732 – TOTOLINK A3002R/A3002RU HTTP POST Request Handler Buffer Overflow

    Common Vulnerabilities and Exposures (CVEs)
    GetResponse

    Highlights

    Development

    Researchers from ETH Zurich, EPFL, and Microsoft Introduce QuaRot: A Machine Learning Method that Enables 4-bit Inference of LLMs by Removing the Outlier Features

    April 5, 2024

    Large language models (LLMs) have revolutionized various applications across industries by providing advanced natural language…

    Microsoft releases firmware updates for Surface Pro 11 and Surface 7

    June 20, 2024

    My favorite robot mower adds two more affordable ‘mini’ units for smaller yards

    January 5, 2025

    Richard Marko: Rethinking cybersecurity in the age of global challenges | Starmus Highlights

    December 7, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.