Poro 34B: A 34B Parameter AI Model Trained for 1T Tokens of Finnish, English, and Programming languages, Including 8B Tokens of Finnish-English Translation Pairs

State-of-the-art language models require vast amounts of text data for pretraining, often in the order of trillions of words, which poses a challenge for smaller languages needing more extensive resources. While leveraging multilingual data is a logical solution, itâ€™s commonly viewed as problematic due to the â€œcurse of multilingualism.â€ Despite some research exploring the benefits and drawbacks of multilingual training and efforts to enhance models for smaller languages, most cutting-edge models still need to be primarily trained in large languages like English. However, thereâ€™s potential to significantly improve models for smaller languages through multilingual training, which could mitigate the data scarcity issue.

Researchers from TurkuNLP Group, University of Turku, Silo AI, University of Helsinki, and CSC â€“ IT Center for Science have developed Poro 34B, a 34-billion-parameter model trained on 1 trillion tokens of Finnish, English, and programming languages. They demonstrate that a multilingual training approach significantly enhances the capabilities of existing Finnish models while excelling in translation and remaining competitive in English and programming tasks. By leveraging insights such as limited multilingualism, matching scripts, language families, oversampling, and augmenting with programming language data, they mitigate data limitations and produce state-of-the-art generative models, notably Poro 34B.

For pretraining Poro 34B, the dataset underwent preprocessing to eliminate low-quality and duplicate texts and filter out toxic contexts. Finnish data, sourced from web crawls, news, and Finnish literature, comprises a 32-billion-token corpus, upsampled for four epochs. English data, derived from SlimPajama and Project Gutenberg, amounts to 542 billion tokens, representing over half of the total training tokens. Programming language data, sourced from the Starcoder corpus, is oversampled to represent approximately one-third of the pretraining tokens. A cross-lingual signal is introduced using English-Finnish translation pairs from the Tatoeba challenge dataset, constituting under 1% of the pretraining tokens.

In tokenization, a custom byte-level BPE tokenizer was developed for Poro 34B, with a 128K token vocabulary, aiming for low fertility across Finnish, English, and code. Pretraining involved training the decoder-only model to 1 trillion tokens, surpassing the estimated optimal compute for efficiency. The training utilized a sequence length of 2048 tokens, a cosine learning rate scheduler, and a Megatron-DeepSpeed fork for AMD GPU compatibility. The configuration included 128 nodes, activation checkpointing, and parallelism strategies. Compute cost, estimated at 448MWh, was assessed for environmental impact, considering LUMIâ€™s renewable energy source. Only GPU power consumption was factored into emissions calculation.

The evaluation of the Poro 34B model across multiple dimensions showcases its strong performance. Poro 34B demonstrates low character-level perplexity across English, Finnish, and code, indicating effective learning across these languages. Across various benchmarks, Poro 34B excels, particularly in Finnish tasks, surpassing previous monolingual models. Its English proficiency remains competitive, comparable to models trained predominantly in English. Notably, Poro 34B exhibits commendable coherence and grammatical correctness in open-ended generation tasks in Finnish text generation. Furthermore, its impressive capabilities outperform dedicated translation models and even Google Translate. These results underscore Poro 34Bâ€™s versatility and effectiveness across diverse language tasks.

In the study, researchers addressed challenges in training large generative models for smaller languages by developing Poro 34B, a 34B-parameter model trained on 1T tokens of Finnish, English, and code, including 8B tokens of Finnish-English translation pairs. The thorough evaluation revealed significant advancements over existing models for Finnish, competitive performance in English and code tasks, and remarkable translation results. The study highlights the limitations of benchmarks derived from translated tasks. Future research aims to explore the effects of multilingual training systematically. Poro 34Bâ€™s release seeks to serve as a template for creating larger models for other smaller languages, facilitating further research and development.Â

Check out theÂ Paper andÂ HF Page.Â All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter.Â Join ourÂ Telegram Channel,Â Discord Channel, andÂ LinkedIn Group.

If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 39k+ ML SubReddit

The post Poro 34B: A 34B Parameter AI Model Trained for 1T Tokens of Finnish, English, and Programming languages, Including 8B Tokens of Finnish-English Translation Pairs appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

Minecraft licensing robbed us of this controversial NFL schedule release video

The power of generators

The power of generators

Simplify Factory Associations with Laravel’s UseFactory Attribute

This Week in Laravel: React Native, PhpStorm Junie, and more

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

Poro 34B: A 34B Parameter AI Model Trained for 1T Tokens of Finnish, English, and Programming languages, Including 8B Tokens of Finnish-English Translation Pairs

Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

CVE-2025-40906 – MongoDB BSON Serialization BSON::XS Multiple Vulnerabilities

CVE-2025-4463 – iSourcecode Gym Management System SQL Injection Vulnerability

CVE-2024-56156 – Halo File Type Validation Bypass Vulnerability

Automating Native Mobile Apps with Appium Commands

Microsoft unlocks Windows 11 24H2 for more gaming PCs with Feb 11 2025 update

Responsible AI: Good governance with or without the government

Sustainable web design trends in digital UX design

EU Flexes Muscles: Metaâ€™s â€˜Pay or Consentâ€™ Model Faces DMA Challenge

Website Inpiration

Poro 34B: A 34B Parameter AI Model Trained for 1T Tokens of Finnish, English, and Programming languages, Including 8B Tokens of Finnish-English Translation Pairs

Related Posts