Tucano: A Series of Decoder-Transformers Natively Pre-Trained in Portuguese

Natural Language Processing (NLP) has advanced significantly with deep learning, driven by innovations like word embeddings and transformer architectures. Self-supervised learning uses vast amounts of unlabeled data to create pretraining tasks and has become a key approach for training models, especially in high-resource languages like English and Chinese. The disparity in NLP resources and performance ranges from high-resource language systems, such as English and Chinese, to low-resource language systems, such as Portuguese, and more than 7000 languages worldwide. Such a gap hinders the ability of NLP applications of low-resource languages to grow and be more robust and accessible. Also, low-resource monolingual models remain small-scale and undocumented, and they lack standard benchmarks, which makes development and evaluation difficult.

Current development methods often utilize vast amounts of data and computational resources readily available for high-resource languages like English and Chinese. Portuguese NLP mostly uses multilingual models like mBERT, mT5, and BLOOM or fine-tunes English-trained models. However, these methods often miss the unique aspects of Portuguese. The evaluation benchmarks are either old or based on English datasets, making them less useful for Portuguese.

To address this, researchers from the University of Bonn have developed GigaVerbo, a large-scale Portuguese text corpus of 200 billion tokens, and trained a series of decoder-transformers named Tucano. These models aim to improve the performance of Portuguese language models by leveraging a substantial and high-quality dataset.

The GigaVerbo dataset is a concatenation of multiple high-quality Portuguese text corpora, refined using custom filtering techniques based on GPT-4 evaluations. The filtering process improved text preprocessing, retaining 70% of the dataset for the model. Based on the Llama architecture, the Tucano models were implemented using Hugging Face for easy community access. Techniques such as RoPE embeddings, root mean square normalization, and Silu activations instead of SwiGLU were used. The training was done using a causal language modeling approach and cross-entropy loss. The models range from 160M to 2.4B parameters, with the largest trained on 515 billion tokens.Â

The evaluation of these models shows that they perform equal to or better than other Portuguese and multilingual language models of similar size on several Portuguese benchmarks. The training loss and validation perplexity curves for the four base models showed that larger models generally reduced loss and perplexity more effectively, with the effect amplified by larger batch sizes. Checkpoints were saved every 10.5 billion tokens, and performance was tracked across several benchmarks. Pearson correlation coefficients indicated mixed results: some benchmarks, like CALAME-PT, LAMBADA, and HellaSwag, improved with scaling, while others, such as the OAB Exams, showed no correlation with token ingestion. Inverse scaling was observed in sub-billion parameter models, suggesting potential limitations.Â Performance benchmarks also reveal that Tucano outperforms multilingual and prior Portuguese models on native evaluations like CALAME-PT and machine-translated tests like LAMBADA.Â

In conclusion, the GigaVerbo and the Tucano series enhance the performance of Portuguese language models. The proposed work covered the development pipeline, which included dataset creation, filtration, hyperparameter tuning, and evaluation, with a focus on openness and reproducibility. It also showed the potential for improving low-resource language models through large-scale data collection and advanced training techniques. The contribution of these researchers will prove beneficial in providing these necessary resources to guide future studies.

Check out the Paper and Hugging Face Page. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter and join ourÂ Telegram Channel andÂ LinkedIn Group. If you like our work, you will love ourÂ newsletter.. Donâ€™t Forget to join ourÂ 55k+ ML SubReddit.

â€˜Evaluation of Large Language Model Vulnerabilities: A Comparative Analysis of Red Teaming Techniquesâ€™ Read the Full Report _(Promoted)

The post Tucano: A Series of Decoder-Transformers Natively Pre-Trained in Portuguese appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Build Confidence In Your UX Work

Windows 11’s new headline security feature is now in testing — here’s how it can help your PC

Atomfall’s Radio Towers confused the daylights out of me, but then I stumbled upon a key character and all became clear

DOOM: The Dark Ages is coming to Blizzard’s Battle.net storefront with Xbox cross-buy support

Microsoft lifts Snapdragon exclusivity on some of the best Copilot+ PC features

Community News: Latest PEAR Releases (03.10.2025)

Community News: Latest PEAR Releases (03.10.2025)

Community News: Latest PECL Releases (03.11.2025)

Comparing Figma-to-Compose Conversion Methods for Android Development

Windows 11’s new headline security feature is now in testing — here’s how it can help your PC

Windows 11’s new headline security feature is now in testing — here’s how it can help your PC

Atomfall’s Radio Towers confused the daylights out of me, but then I stumbled upon a key character and all became clear

DOOM: The Dark Ages is coming to Blizzard’s Battle.net storefront with Xbox cross-buy support

Tucano: A Series of Decoder-Transformers Natively Pre-Trained in Portuguese

ruby-align is Baseline Newly available

February 2025 Baseline monthly digest

🚀 Must-Try AI Tools of 2025: 32 Game-Changing Picks (Tested & Verified)

This AI Research Introduces SubGDiff: Utilizing Diffusion Model to Improve Molecular Representation Learning

Upgrade strategies for Amazon Aurora PostgreSQL and Amazon RDS for PostgreSQL 12

Top AI Excel Tools in 2024

Amaraâ€™s Law: How the AI Hype Cycle Leads to Disillusionment

Don’t want to pay for ChatGPT Deep Research? Try this free open-source alternative

Is this the iPhone SE 4? Leaked photos reveal major changes coming in 2025

AtomAgents: A Multi-Agent AI System to Autonomously Design Metallic Alloys

Tucano: A Series of Decoder-Transformers Natively Pre-Trained in Portuguese

Related Posts