Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»NASA and IBM Researchers Introduce INDUS: A Suite of Domain-Specific Large Language Models (LLMs) for Advanced Scientific Research

    NASA and IBM Researchers Introduce INDUS: A Suite of Domain-Specific Large Language Models (LLMs) for Advanced Scientific Research

    July 4, 2024

    Large Language Models (LLMs), trained on vast amounts of data, have shown remarkable abilities in natural language generation and understanding. General-purpose corpora, comprising a diverse range of online text, are utilized for their training, examples of which are Wikipedia and CommonCrawl. Although these universal models work well on a wide range of tasks, a distributional shift in vocabulary and context causes them to perform poorly in specialized domains. 

    In a recent study, a team of researchers from NASA and IBM collaborated to develop a model that could be applied to Earth sciences, astronomy, physics, astrophysics, heliophysics, planetary sciences, and biology, among other multidisciplinary subjects. Current models such as SCIBERT, BIOBERT, and SCHOLARBERT only partially cover some of these domains. There is no existing model that fully takes into account all these related fields.

    To bridge this gap, the team has developed INDUS, a set of encoder-based LLMs specialized in these particular sectors. Since INDUS is trained on carefully selected corpora from various sources, it is guaranteed to cover the body of knowledge in these fields. The INDUS suite includes several types of models to address different needs, which are as follows.

    Encoder Model: This model is trained on domain-specific vocabulary and corpora to excel in tasks related to natural language understanding.

    Contrastive-Learning-Based General Text Embedding Model: This model uses a wide range of datasets from multiple sources to improve performance in information retrieval tasks.

    Smaller Model Versions: These versions are created using knowledge distillation techniques, making them suitable for applications requiring lower latency or limited computational resources.

    The team has also produced three new scientific benchmark datasets to advance these interdisciplinary domains’ research.

    CLIMATE-CHANGE NER: A climate change-related entity recognition dataset.

    NASA-QA: A dataset devoted to NASA-related topics used for extractive question answering.

    NASA-IR: A dataset focusing on NASA-related content used for information retrieval tasks.

    The team has summarized their primary contributions as follows.

    The byte-pair encoding (BPE) technique has been used to create INDUSBPE, a specialized tokenizer. Because it was built from a carefully selected scientific corpus, this tokenizer can handle the specialized terms and language used in fields like Earth science, biology, physics, heliophysics, planetary sciences, and astrophysics. The INDUSBPE tokenizer improves the model’s comprehension and handling of domain-specific language.

    Using the INDUSBPE tokenizer and the carefully selected scientific corpora, the team has pretrained a number of encoder-only LLMs. Sentence-embedding models have been created by fine-tuning these pretrained models with a contrastive learning objective, which helps in learning universal sentence embeddings. 

    More efficient, smaller versions of these models have also been trained using knowledge-distillation techniques, which kept their outstanding performance even in resource-constrained scenarios.

    Three new scientific benchmark datasets have been launched to help expedite research in interdisciplinary disciplines. These include NASA-QA, an extractive question-answering task based on NASA-related themes; NASA-CHANGE NER, an entity recognition task focused on entities connected to climate change; and NASA-IR, a dataset intended for information retrieval tasks inside NASA-related content. The purpose of these datasets is to offer exacting standards for assessing model performance in these particular fields.

    The experimental findings have shown that these models perform well on both the recently created benchmark tasks and the currently used domain-specific benchmarks. They performed better than domain-specific encoders like SCIBERT and general-purpose models like RoBERTa. 

    In conclusion, INDUS is a big advancement in the field of Artificial Intelligence,  giving professionals and researchers in various scientific domains a strong tool that improves their capacity to carry out accurate and effective Natural Language Processing jobs.

    Check out the Paper and Blog. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. 

    Join our Telegram Channel and LinkedIn Group.

    If you like our work, you will love our newsletter..

    Don’t Forget to join our 46k+ ML SubReddit

    The post NASA and IBM Researchers Introduce INDUS: A Suite of Domain-Specific Large Language Models (LLMs) for Advanced Scientific Research appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleA Simple Open-loop Model-Free Baseline for Reinforcement Learning Locomotion Tasks without Using Complex Models or Computational Resources
    Next Article Spectrum: An AI Method that Accelerates LLM Training by Selectively Targeting Layer Modules based on their Signal-to-Noise Ratio (SNR)

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 16, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2022-4363 – Wholesale Market WooCommerce CSRF Vulnerability

    May 16, 2025
    Leave A Reply Cancel Reply

    Hostinger

    Continue Reading

    Oracle expands AI capabilities with powerful new Agent Studio tool

    News & Updates

    Spanish Police Bust €5.3 Million Illegal Streaming Network

    Development

    Cybersecurity Blind Spots in IaC and PaC Tools Expose Cloud Platforms to New Attacks

    Development

    Every game shown for Xbox and PC at the Annapurna Interactive Showcase February 2025

    News & Updates

    Highlights

    News & Updates

    After 29 years, a veteran Microsoft Engineer admits “MS-DOS could do graphics,” but the company opted for a lackluster UI — as Windows 3.1 runtime already checked the missing boxes

    February 20, 2025

    Veteran Microsoft Engineer Raymond Chen reveals why Microsoft decided to make Windows 95 setups so…

    WebGL Shader Techniques for Dynamic Image Transitions

    January 22, 2025

    Obsidian is trolling all of us by disabling one of Avowed’s most useful settings by default, and we’re mad about it

    February 20, 2025

    HamonirKR is a Korean Linux distribution

    April 19, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.