Since the release of BERT in 2018, encoder-only transformer models have been widely used in natural language processing (NLP) applications due to their efficiency in retrieval and classification tasks. However, these models face notable limitations in contemporary applications. Their sequence length, capped at 512 tokens, hampers their ability to handle long-context tasks effectively. Furthermore, their architecture, vocabulary, and computational efficiency have not kept pace with advancements in hardware and training methodologies. These shortcomings become especially apparent in retrieval-augmented generation (RAG) pipelines, where encoder-based models provide context for large language models (LLMs). Despite their critical role, these models often rely on outdated designs, limiting their capacity to meet evolving demands.
A team of researchers from LightOn, Answer.ai, Johns Hopkins University, NVIDIA, and Hugging Face have sought to address these challenges with the introduction of ModernBERT, an open family of encoder-only models. ModernBERT brings several architectural enhancements, extending the context length to 8,192 tokens—a significant improvement over the original BERT. This increase enables it to perform well on long-context tasks. The integration of Flash Attention 2 and rotary positional embeddings (RoPE) enhances computational efficiency and positional understanding. Trained on 2 trillion tokens from diverse domains, including code, ModernBERT demonstrates improved performance across multiple tasks. It is available in two configurations: base (139M parameters) and large (395M parameters), offering options tailored to different needs while consistently outperforming models like RoBERTa and DeBERTa.
Technical Details and Benefits
ModernBERT incorporates several advancements in transformer design. Flash Attention enhances memory and computational efficiency, while alternating global-local attention mechanisms optimize long-context processing. RoPE embeddings improve positional understanding, ensuring effective performance across varied sequence lengths. The model also employs GeGLU activation functions and a deep, narrow architecture for a balanced trade-off between efficiency and capability. Stability during training is further ensured through pre-normalization blocks and the use of the StableAdamW optimizer with a trapezoidal learning rate schedule. These refinements make ModernBERT not only faster but also more resource-efficient, particularly for inference tasks on common GPUs.
Results and Insights
ModernBERT demonstrates strong performance across benchmarks. On the General Language Understanding Evaluation (GLUE) benchmark, it surpasses existing base models, including DeBERTaV3. In retrieval tasks like Dense Passage Retrieval (DPR) and ColBERT multi-vector retrieval, it achieves higher nDCG@10 scores compared to its peers. The model’s capabilities in long-context tasks are evident in the MLDR benchmark, where it outperforms older models and specialized long-context models such as GTE-en-MLM and NomicBERT. ModernBERT also excels in code-related tasks, including CodeSearchNet and StackOverflow-QA, benefiting from its code-aware tokenizer and diverse training data. Additionally, it processes significantly larger batch sizes than its predecessors, making it suitable for large-scale applications while maintaining memory efficiency.
Conclusion
ModernBERT represents a thoughtful evolution of encoder-only transformer models, integrating modern architectural improvements with robust training methodologies. Its extended context length and enhanced efficiency address the limitations of earlier models, making it a versatile tool for a variety of NLP applications, including semantic search, classification, and code retrieval. By modernizing the foundational BERT architecture, ModernBERT meets the demands of contemporary NLP tasks. Released under the Apache 2.0 license and hosted on Hugging Face, it provides an accessible and efficient solution for researchers and practitioners seeking to advance the state of the art in NLP.
Check out the Paper, Blog, and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.
The post LightOn and Answer.ai Releases ModernBERT: A New Model Series that is a Pareto Improvement over BERT with both Speed and Accuracy appeared first on MarkTechPost.
Source: Read MoreÂ