NASA and IBM Researchers Introduce INDUS: A Suite of Domain-Specific Large Language Models (LLMs) for Advanced Scientific Research

Large Language Models (LLMs), trained on vast amounts of data, have shown remarkable abilities in natural language generation and understanding. General-purpose corpora, comprising a diverse range of online text, are utilized for their training, examples of which are Wikipedia and CommonCrawl. Although these universal models work well on a wide range of tasks, a distributional shift in vocabulary and context causes them to perform poorly in specialized domains.Â

In a recent study, a team of researchers from NASA and IBM collaborated to develop a model that could be applied to Earth sciences, astronomy, physics, astrophysics, heliophysics, planetary sciences, and biology, among other multidisciplinary subjects. Current models such as SCIBERT, BIOBERT, and SCHOLARBERT only partially cover some of these domains. There is no existing model that fully takes into account all these related fields.

To bridge this gap, the team has developed INDUS, a set of encoder-based LLMs specialized in these particular sectors. Since INDUS is trained on carefully selected corpora from various sources, it is guaranteed to cover the body of knowledge in these fields. The INDUS suite includes several types of models to address different needs, which are as follows.

Encoder Model: This model is trained on domain-specific vocabulary and corpora to excel in tasks related to natural language understanding.

Contrastive-Learning-Based General Text Embedding Model: This model uses a wide range of datasets from multiple sources to improve performance in information retrieval tasks.

Smaller Model Versions: These versions are created using knowledge distillation techniques, making them suitable for applications requiring lower latency or limited computational resources.

The team has also produced three new scientific benchmark datasets to advance these interdisciplinary domainsâ€™ research.

CLIMATE-CHANGE NER: A climate change-related entity recognition dataset.

NASA-QA: A dataset devoted to NASA-related topics used for extractive question answering.

NASA-IR: A dataset focusing on NASA-related content used for information retrieval tasks.

The team has summarized their primary contributions as follows.

The byte-pair encoding (BPE) technique has been used to create INDUSBPE, a specialized tokenizer. Because it was built from a carefully selected scientific corpus, this tokenizer can handle the specialized terms and language used in fields like Earth science, biology, physics, heliophysics, planetary sciences, and astrophysics. The INDUSBPE tokenizer improves the modelâ€™s comprehension and handling of domain-specific language.

Using the INDUSBPE tokenizer and the carefully selected scientific corpora, the team has pretrained a number of encoder-only LLMs. Sentence-embedding models have been created by fine-tuning these pretrained models with a contrastive learning objective, which helps in learning universal sentence embeddings.Â

More efficient, smaller versions of these models have also been trained using knowledge-distillation techniques, which kept their outstanding performance even in resource-constrained scenarios.

Three new scientific benchmark datasets have been launched to help expedite research in interdisciplinary disciplines. These include NASA-QA, an extractive question-answering task based on NASA-related themes; NASA-CHANGE NER, an entity recognition task focused on entities connected to climate change; and NASA-IR, a dataset intended for information retrieval tasks inside NASA-related content. The purpose of these datasets is to offer exacting standards for assessing model performance in these particular fields.

The experimental findings have shown that these models perform well on both the recently created benchmark tasks and the currently used domain-specific benchmarks. They performed better than domain-specific encoders like SCIBERT and general-purpose models like RoBERTa.Â

In conclusion, INDUS is a big advancement in the field of Artificial Intelligence,Â giving professionals and researchers in various scientific domains a strong tool that improves their capacity to carry out accurate and effective Natural Language Processing jobs.

Check out the Paper and Blog. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter.Â

Join ourÂ Telegram Channel andÂ LinkedIn Group.

If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 46k+ ML SubReddit

The post NASA and IBM Researchers Introduce INDUS: A Suite of Domain-Specific Large Language Models (LLMs) for Advanced Scientific Research appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

Minecraft licensing robbed us of this controversial NFL schedule release video

The power of generators

The power of generators

Simplify Factory Associations with Laravel’s UseFactory Attribute

This Week in Laravel: React Native, PhpStorm Junie, and more

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

NASA and IBM Researchers Introduce INDUS: A Suite of Domain-Specific Large Language Models (LLMs) for Advanced Scientific Research

Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

CVE-2022-4363 – Wholesale Market WooCommerce CSRF Vulnerability

Oracle expands AI capabilities with powerful new Agent Studio tool

Spanish Police Bust â‚¬5.3 Million Illegal Streaming Network

Cybersecurity Blind Spots in IaC and PaC Tools Expose Cloud Platforms to New Attacks

Every game shown for Xbox and PC at the Annapurna Interactive Showcase February 2025

After 29 years, a veteran Microsoft Engineer admits “MS-DOS could do graphics,” but the company opted for a lackluster UI — as Windows 3.1 runtime already checked the missing boxes

WebGL Shader Techniques for Dynamic Image Transitions

Obsidian is trolling all of us by disabling one of Avowed’s most useful settings by default, and we’re mad about it

HamonirKR is a Korean Linux distribution

NASA and IBM Researchers Introduce INDUS: A Suite of Domain-Specific Large Language Models (LLMs) for Advanced Scientific Research

Related Posts