Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      May 16, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      May 16, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      May 16, 2025

      How To Prevent WordPress SQL Injection Attacks

      May 16, 2025

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025

      Minecraft licensing robbed us of this controversial NFL schedule release video

      May 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      The power of generators

      May 16, 2025
      Recent

      The power of generators

      May 16, 2025

      Simplify Factory Associations with Laravel’s UseFactory Attribute

      May 16, 2025

      This Week in Laravel: React Native, PhpStorm Junie, and more

      May 16, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025
      Recent

      Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

      May 16, 2025

      Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

      May 16, 2025

      Microsoft might kill the Surface Laptop Studio as production is quietly halted

      May 16, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Poro 34B: A 34B Parameter AI Model Trained for 1T Tokens of Finnish, English, and Programming languages, Including 8B Tokens of Finnish-English Translation Pairs

    Poro 34B: A 34B Parameter AI Model Trained for 1T Tokens of Finnish, English, and Programming languages, Including 8B Tokens of Finnish-English Translation Pairs

    April 5, 2024

    State-of-the-art language models require vast amounts of text data for pretraining, often in the order of trillions of words, which poses a challenge for smaller languages needing more extensive resources. While leveraging multilingual data is a logical solution, it’s commonly viewed as problematic due to the “curse of multilingualism.” Despite some research exploring the benefits and drawbacks of multilingual training and efforts to enhance models for smaller languages, most cutting-edge models still need to be primarily trained in large languages like English. However, there’s potential to significantly improve models for smaller languages through multilingual training, which could mitigate the data scarcity issue.

    Researchers from TurkuNLP Group, University of Turku, Silo AI, University of Helsinki, and CSC – IT Center for Science have developed Poro 34B, a 34-billion-parameter model trained on 1 trillion tokens of Finnish, English, and programming languages. They demonstrate that a multilingual training approach significantly enhances the capabilities of existing Finnish models while excelling in translation and remaining competitive in English and programming tasks. By leveraging insights such as limited multilingualism, matching scripts, language families, oversampling, and augmenting with programming language data, they mitigate data limitations and produce state-of-the-art generative models, notably Poro 34B.

    For pretraining Poro 34B, the dataset underwent preprocessing to eliminate low-quality and duplicate texts and filter out toxic contexts. Finnish data, sourced from web crawls, news, and Finnish literature, comprises a 32-billion-token corpus, upsampled for four epochs. English data, derived from SlimPajama and Project Gutenberg, amounts to 542 billion tokens, representing over half of the total training tokens. Programming language data, sourced from the Starcoder corpus, is oversampled to represent approximately one-third of the pretraining tokens. A cross-lingual signal is introduced using English-Finnish translation pairs from the Tatoeba challenge dataset, constituting under 1% of the pretraining tokens.

    In tokenization, a custom byte-level BPE tokenizer was developed for Poro 34B, with a 128K token vocabulary, aiming for low fertility across Finnish, English, and code. Pretraining involved training the decoder-only model to 1 trillion tokens, surpassing the estimated optimal compute for efficiency. The training utilized a sequence length of 2048 tokens, a cosine learning rate scheduler, and a Megatron-DeepSpeed fork for AMD GPU compatibility. The configuration included 128 nodes, activation checkpointing, and parallelism strategies. Compute cost, estimated at 448MWh, was assessed for environmental impact, considering LUMI’s renewable energy source. Only GPU power consumption was factored into emissions calculation.

    The evaluation of the Poro 34B model across multiple dimensions showcases its strong performance. Poro 34B demonstrates low character-level perplexity across English, Finnish, and code, indicating effective learning across these languages. Across various benchmarks, Poro 34B excels, particularly in Finnish tasks, surpassing previous monolingual models. Its English proficiency remains competitive, comparable to models trained predominantly in English. Notably, Poro 34B exhibits commendable coherence and grammatical correctness in open-ended generation tasks in Finnish text generation. Furthermore, its impressive capabilities outperform dedicated translation models and even Google Translate. These results underscore Poro 34B’s versatility and effectiveness across diverse language tasks.

    In the study, researchers addressed challenges in training large generative models for smaller languages by developing Poro 34B, a 34B-parameter model trained on 1T tokens of Finnish, English, and code, including 8B tokens of Finnish-English translation pairs. The thorough evaluation revealed significant advancements over existing models for Finnish, competitive performance in English and code tasks, and remarkable translation results. The study highlights the limitations of benchmarks derived from translated tasks. Future research aims to explore the effects of multilingual training systematically. Poro 34B’s release seeks to serve as a template for creating larger models for other smaller languages, facilitating further research and development. 

    Check out the Paper and HF Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

    If you like our work, you will love our newsletter..

    Don’t Forget to join our 39k+ ML SubReddit

    The post Poro 34B: A 34B Parameter AI Model Trained for 1T Tokens of Finnish, English, and Programming languages, Including 8B Tokens of Finnish-English Translation Pairs appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleResearchers from NYU and the University of Maryland Unveil an Artificial Intelligence Framework for Understanding and Extracting Style Descriptors from Images
    Next Article This AI Paper from King’s College London Introduces a Theoretical Analysis of Neural Network Architectures Through Topos Theory

    Related Posts

    Security

    Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

    May 17, 2025
    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-40906 – MongoDB BSON Serialization BSON::XS Multiple Vulnerabilities

    May 17, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    CVE-2025-4463 – iSourcecode Gym Management System SQL Injection Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2024-56156 – Halo File Type Validation Bypass Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Automating Native Mobile Apps with Appium Commands

    Development

    Microsoft unlocks Windows 11 24H2 for more gaming PCs with Feb 11 2025 update

    Operating Systems

    Highlights

    Responsible AI: Good governance with or without the government

    January 21, 2025

    In his 2024 Nobel Prize banquet speech, Geoffrey Hinton, often described as the “godfather of…

    Sustainable web design trends in digital UX design

    January 22, 2025

    EU Flexes Muscles: Meta’s ‘Pay or Consent’ Model Faces DMA Challenge

    July 2, 2024

    Website Inpiration

    December 7, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.