This AI Paper from SambaNova Presents a Machine Learning Method to Adapt Pretrained LLMs to New Languages

The rapid advancement of large language models has ushered in a new era of natural language processing capabilities. However, a significant challenge persists: most of these models are primarily trained on a limited set of widely spoken languages, leaving a vast linguistic diversity unexplored. This limitation not only restricts the accessibility of cutting-edge language technologies but also perpetuates a technological divide across linguistic communities.

Researchers have tackled this challenge in this study by proposing a novel AI method named SambaLingo. This approach aims to adapt existing, high-performing language models to new languages, leveraging the strengths of pre-trained models while tailoring them to the unique characteristics of the target language.

Previous efforts to address this issue have primarily focused on training monolithic multilingual or language-specific models from scratch. However, these approaches face significant hurdles, including the curse of multilinguality, data scarcity, and the substantial computational resources required. Adapting English-centric models to new languages has emerged as a promising alternative, demonstrating the potential to outperform language-specific models pre-trained from scratch.

The SambaLingo methodology begins with the selection of a suitable base model that has already exhibited exceptional performance in its initial language. In this study, the researchers chose the open-source Llama2 7B model, renowned for its English language capabilities, as their starting point.

To effectively capture the linguistic nuances of the target language, the researchers expanded the modelâ€™s vocabulary by adding non-overlapping tokens from the target language and initializing them using sub-word embeddings from the original tokenizer. This crucial step ensures that the model can accurately tokenize and represent the new language, paving the way for seamless adaptation.

Next, the researchers employed a continual pre-training approach, feeding the model a carefully curated mixture of English and target language web data sourced from CulturaX. The data mixture followed a 1:3 ratio, biased towards the target language, to strike a delicate balance between preserving the modelâ€™s existing knowledge and adapting it to the new linguistic landscape.

To further enhance the modelâ€™s alignment with human preferences, the researchers implemented a two-stage process: supervised fine-tuning (SFT) and direct preference optimization (DPO). During SFT, they utilized the ultrachat-200k dataset and its machine-translated version. For DPO, they employed ultra feedback and cai-conversation-harmless datasets, blending them with a 10:1 ratio of English to machine-translated data.

The researchers rigorously evaluated the SambaLingo models across various tasks and languages, including language modeling, translation, text classification, open-book and closed-book question answering, and various natural language understanding benchmarks as shown in Table 1. The models were tested on nine typologically diverse languages: Arabic, Thai, Turkish, Japanese, Hungarian, Russian, Bulgarian, Serbian, and Slovenian.

Across multiple benchmarks, the SambaLingo models consistently outperformed existing state-of-the-art models in these languages. For instance, on the perplexity benchmark, which measures language modeling performance, the SambaLingo models achieved lower perplexity scores than all existing baselines on a held-out set from their training data (Figure 1). Furthermore, when scaled to the larger Llama2 70B parameter scale, the SambaLingo models exhibited even better performance, surpassing their 7B counterparts across multiple benchmarks, despite being trained on fewer tokens.

To validate the quality of the modelâ€™s outputs and their alignment with human preferences, the researchers employed GPT-4 as an impartial judge, evaluating the modelâ€™s responses to real user prompts. The results were promising, with SambaLingo consistently outperforming other models in the same languages, as judged by GPT-4â€™s preferences and logical explanations.

In summary, the SambaLingo methodology represents a significant stride towards democratizing artificial intelligence across linguistic diversity. By leveraging the strengths of existing high-performing models and tailoring them to new linguistic landscapes, this approach offers a scalable and efficient solution to the challenge of language barriers. With its state-of-the-art performance and alignment with human preferences, SambaLingo paves the way for a future where the benefits of AI transcend linguistic boundaries, fostering inclusivity and accessibility for all.

Check out theÂ Paper.Â All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter.Â Join ourÂ Telegram Channel,Â Discord Channel, andÂ LinkedIn Group.

If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 40k+ ML SubReddit

Want to get in front of 1.5 Million AI Audience?Â Work with us here

The post This AI Paper from SambaNova Presents a Machine Learning Method to Adapt Pretrained LLMs to New Languages appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Build Confidence In Your UX Work

“Touch Grass without touching grass” with these hilarious (and very real) skins for Xbox, Steam Deck, laptop, phone, and more

Microsoft Teams will fix meeting chats for presenters with this small change

ChatGPT’s stunning new image generator is now free for everyone

Everything coming to Call of Duty: Black Ops 6 multiplayer with Season 3

Community News: Latest PEAR Releases (03.10.2025)

Community News: Latest PEAR Releases (03.10.2025)

Community News: Latest PECL Releases (03.11.2025)

Image Dimension Validation with Laravel’s dimensions Rule

“Touch Grass without touching grass” with these hilarious (and very real) skins for Xbox, Steam Deck, laptop, phone, and more

“Touch Grass without touching grass” with these hilarious (and very real) skins for Xbox, Steam Deck, laptop, phone, and more

Microsoft Teams will fix meeting chats for presenters with this small change

Everything coming to Call of Duty: Black Ops 6 multiplayer with Season 3

This AI Paper from SambaNova Presents a Machine Learning Method to Adapt Pretrained LLMs to New Languages

ruby-align is Baseline Newly available

February 2025 Baseline monthly digest

Best Free and Open Source Alternatives to Apple Finder

These 2 must-have tools make DIY projects less frustrating (and they’re cheap)

1ï¸âƒ£ vue-bind-once

CodeSOD: Maximizing Code Quality

Hiring Kit: Platform Developer

Google AI Introduces Learn-by-Interact: A Data-Centric Framework for Adaptive and Efficient LLM Agent Development

PyPI Introduces Archival Status to Alert Users About Unmaintained Python Packages

Recursive IntroSpEction (RISE): A Machine Learning Approach for Fine-Tuning LLMs to Improve Their Own Responses Over Multiple Turns Sequentially

This AI Paper from SambaNova Presents a Machine Learning Method to Adapt Pretrained LLMs to New Languages

Related Posts