Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 13, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 13, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 13, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 13, 2025

      This $4 Steam Deck game includes the most-played classics from my childhood — and it will save you paper

      May 13, 2025

      Microsoft shares rare look at radical Windows 11 Start menu designs it explored before settling on the least interesting one of the bunch

      May 13, 2025

      NVIDIA’s new GPU driver adds DOOM: The Dark Ages support and improves DLSS in Microsoft Flight Simulator 2024

      May 13, 2025

      How to install and use Ollama to run AI LLMs on your Windows 11 PC

      May 13, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Community News: Latest PECL Releases (05.13.2025)

      May 13, 2025
      Recent

      Community News: Latest PECL Releases (05.13.2025)

      May 13, 2025

      How We Use Epic Branches. Without Breaking Our Flow.

      May 13, 2025

      I think the ergonomics of generators is growing on me.

      May 13, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      This $4 Steam Deck game includes the most-played classics from my childhood — and it will save you paper

      May 13, 2025
      Recent

      This $4 Steam Deck game includes the most-played classics from my childhood — and it will save you paper

      May 13, 2025

      Microsoft shares rare look at radical Windows 11 Start menu designs it explored before settling on the least interesting one of the bunch

      May 13, 2025

      NVIDIA’s new GPU driver adds DOOM: The Dark Ages support and improves DLSS in Microsoft Flight Simulator 2024

      May 13, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Unlocking the Language of Proteins: How Large Language Models Are Revolutionizing Protein Sequence Understanding

    Unlocking the Language of Proteins: How Large Language Models Are Revolutionizing Protein Sequence Understanding

    June 14, 2024

    Researchers have drawn parallels between protein sequences and natural language due to their sequential structures, leading to advancements in deep learning models for both fields. LLMs have excelled in NLP tasks, and this success has inspired attempts to adapt them to understanding proteins. However, this adaptation faces a challenge: existing datasets need more direct correlations between protein sequences and text descriptions, hindering effective training and evaluation of LLMs for protein comprehension. Despite advances in MMLMs, the absence of comprehensive datasets integrating protein sequences with textual content limits the full utilization of these models in protein science.

    Researchers from several institutions, including Johns Hopkins and UNSW Sydney, have created ProteinLMDataset to enhance LLMs’ understanding of protein sequences. This dataset contains 17.46 billion tokens for self-supervised pretraining and 893K instructions for supervised fine-tuning. They also developed ProteinLMBench, the first benchmark with 944 manually verified multiple-choice questions for evaluating protein comprehension in LLMs. The dataset and benchmark aim to bridge the gap in protein-text data integration, enabling LLMs to understand protein sequences without extra encoders and to generate accurate protein knowledge using the novel Enzyme Chain of Thought (ECoT) approach.

    The literature review highlights key limitations in existing datasets and NLP and protein sequence benchmarks. There need to be more comprehensive, multi-task, and multi-domain evaluations for Chinese-English datasets, with existing benchmarks often restricted geographically and needing more interpretability. In protein sequence datasets, major resources like UniProtKB and RefSeq face challenges in fully representing protein diversity and accurately annotating data, with biases and errors from community contributions and automated systems. While comprehensive, protein design databases like KEGG and STRING are limited by biases, resource-intensive curation, and difficulties in integrating diverse data sources.

    The ProteinLMDataset is divided into self-supervised and supervised components. The self-supervised dataset includes Chinese-English scientific texts, protein sequence-English text pairs from PubMed and UniProtKB, and extensive entries from the PMC database, providing over 10 billion tokens. The supervised fine-tuning component consists of 893,000 instructions across seven segments, such as enzyme functionality and disease involvement, mainly sourced from UniProtKB. ProteinLMBench, the evaluation benchmark, contains 944 meticulously curated multiple-choice questions on protein properties and sequences. This dataset collection method ensures comprehensive representation, filtering, and tokenization for effective training and evaluation of LLMs in protein science.

    The ProteinLMDataset and ProteinLMBench are designed for comprehensive protein sequence understanding. The dataset is diverse, with tokens ranging from 21 to over 2 million characters, collected from multiple sources, including Chinese-English text pairs, PubMed abstracts, and UniProtKB. The self-supervised data primarily consists of protein sequences and scientific texts. At the same time, the supervised fine-tuning dataset covers seven segments like enzyme functionality and disease involvement, with token lengths from 65 to 70,500. The ProteinLMBench includes 944 balanced multiple-choice questions to evaluate model performance. Rigorous safety checks and filtering ensure data quality and integrity. Experiment results show that combining self-supervised learning with fine-tuning enhances model accuracy, underscoring the dataset’s efficacy.

    In conclusion, The ProteinLMDataset and ProteinLMBench provide a robust framework for training and evaluating language models on protein sequences and bilingual texts. By encompassing diverse sources and including Chinese-English text pairs, the dataset enhances multilingual and cross-lingual understanding of protein characteristics. Experiments demonstrate significant improvements in model accuracy with fine-tuning, especially when using both self-supervised and supervised datasets. This work bridges the gap in adapting LLMs for protein science, showcasing the potential to transform biological research and applications. The InternLM2-7B model, when trained on this dataset, surpasses GPT-4 in protein comprehension tasks.

    issues.

    Check out the Paper and Dataset. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. 

    Join our Telegram Channel and LinkedIn Group.

    If you like our work, you will love our newsletter..

    Don’t Forget to join our 44k+ ML SubReddit

    The post Unlocking the Language of Proteins: How Large Language Models Are Revolutionizing Protein Sequence Understanding appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleTiTok: An Innovative AI Method for Tokenizing Images into 1D Latent Sequences
    Next Article Top Artificial Intelligence AI Courses from Stanford

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 14, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2024-52290 – LF Edge eKuiper Cross-Site Scripting (XSS)

    May 14, 2025
    Leave A Reply Cancel Reply

    Hostinger

    Continue Reading

    Glassmorphism: Definition and Best Practices

    Development

    5 Best HTML Cheat Sheets 2025

    Development

    ORiGAMi: A Machine Learning Architecture for the Document Model

    Databases

    CVE-2025-32756 – Fortinet FortiVoice Buffer Overflow Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    Development

    Product Walkthrough: A Look Inside Wing Security’s Layered SaaS Identity Defense

    April 16, 2025

    Intro: Why hack in when you can log in? SaaS applications are the backbone of…

    Neha Pasi Leads Global Development for Perficient’s Sitecore Team with Precision and Passion

    January 13, 2025

    Fireworks AI et MongoDB : les applications d’IA les plus rapides avec les meilleurs modèles, alimentées par vos données

    April 11, 2024

    This AI Paper from Google DeepMind Introduces Enhanced Learning Capabilities with Many-Shot In-Context Learning

    April 28, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.