This AI Paper from Huawei Introduces a Theoretical Framework Focused on the Memorization Process and Performance Dynamics of Transformer-based Language Models (LMs)

Transformer-based neural networks have shown great ability to handle multiple tasks like text generation, editing, and question-answering. In many cases, models that use more parameters show better performance measured by perplexity and high accuracies of end tasks. This is the main reason for the development of larger models in industries. However, larger models sometimes result in a bad performance, for example, Â the 2B model MiniCPM exhibits comparable capabilities to larger language models, such as Llama2-7B, Mistral-7B, Gemma-7B, and Llama-13B. Moreover, the size of high-quality data available may not keep pace as the computational resources for training larger models increase.Â

Current methods to overcome such shortcomings include Scaling laws, Energy-based models, and Hopfield models. In scaling laws, the performance of models increases when there is a scale-up in the modelsâ€™ size and volume of training data. Energy-based models have become famous as a fundamental modeling tool in different areas of machine learning over the past few decades. The main idea of this method is to model the neural network using a parameterized probability density function to present the distribution in terms of a learnable energy function. The last one is the Hopfield model, in which the classical Hopfield networks were developed as an example of associative memory.Â

Researchers from Central Research Institute, 2012 Laboratories Huawei Technologies Co., Ltd. introduced a theoretical framework focused on the memorization process and performance dynamics of transformer-based language models (LMs). Researchers carried out a series of experiments using GPT-2 across different data sizes to overcome the signs of saturation and, at the same time, trained vanilla Transformer models on a dataset consisting of 2M tokens. The results of these experiments validated the theoretical results, offering important theoretical insights on the optimal cross-entropy-loss that can guide and improve decision-making in model training.Â

A 12-layer transformer LM is trained using the GPT-2 small tokenizer and architecture on the OpenWebText dataset. This dataset is similar to the WebText dataset used for original GPT-2 model training, which contains 9B tokens from 8,013,769 documents. Using different amounts of data, three models are trained where a subset containing the first 1% (90M) and 0.1% (9M) of the OpenWebText data is created. Further, vanilla transformer models are trained using a small amount of high-quality data that contains pairs of English sentences in declarative formation and is context-free with a vocabulary size of 68 words, where the task is to convert declarative sentences into questions.

The training with 0.1% (9M) of the OpenWebText data shows over-fitting, and the training loss disappears over iterations. This happens because the training samples are not well-separated due to which the model energy decreases to a sum of some delta functions. When the model size is about the order O(D2) and trained on 90M tokens, the model can achieve similar training and validation loss compared to the setting with 9B tokens. Two vanilla Transformers of 6 and 10 layers are trained using a batch size of 8, and the training losses stabilize at a value of around 1 as predicted in Proposition.

In conclusion, researchers presented a theoretical framework focused on the memorization process and performance dynamics of transformer-based language models LMs. In this paper, transformer-based networks are modeled using associative memory, and cross-entropy loss is highlighted for model and data sizes. Also, experiments are carried out by (a) utilizing GPT-2 of different data sizes and (b) training vanilla Transformer models on a dataset of 2M tokens. Finally, a global energy function is created for the layered structure of the transformer models using the majorization-minimization technique.

Check out theÂ Paper. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter.Â Join ourÂ Telegram Channel,Â Discord Channel, andÂ LinkedIn Group.

If you like our work, you will love ourÂ newsletter..

Donâ€™t Forget to join ourÂ 42k+ ML SubReddit

The post This AI Paper from Huawei Introduces a Theoretical Framework Focused on the Memorization Process and Performance Dynamics of Transformer-based Language Models (LMs) appeared first on MarkTechPost.

Source: Read MoreÂ

IBM’s next generation Granite models are now available

The Human Element: Using Research And Psychology To Elevate Data Storytelling

Google to offer free version of Gemini Code Assist

MongoDB acquires Voyage AI for its embedding and reranking models

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

Razer and Minecraft just announced a limited-edition collection, and I’m surprised it took so long

Panos Panay’s Amazon AI move: A bold bet or another Surface Duo?

OpenAI expands ‘Deep Reseach’ to those paying $20 a month or more, a day after Microsoft made OpenAI’s ‘Think Deeper’ free for all Copilot users with no usage caps

Rethink State💡 Why You Should Model Your Frontend Around Events

Rethink State💡 Why You Should Model Your Frontend Around Events

What To Expect When Migrating Your Site To A New Platform

Kotlin Multiplatform vs. React Native vs. Flutter: Building Your First App

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

AI-generated content in games is here to stay — the bigger issue is the outright deception and what the future may look like

Razer and Minecraft just announced a limited-edition collection, and I’m surprised it took so long

Panos Panay’s Amazon AI move: A bold bet or another Surface Duo?

This AI Paper from Huawei Introduces a Theoretical Framework Focused on the Memorization Process and Performance Dynamics of Transformer-based Language Models (LMs)

ANDI Accessibility Testing Tool Tutorial

How Data Analytics in Insurance is Driving Smarter Decisions

Docker Container Logging during Testing

From concept to reality: Navigating the Journey of RAG from proof of concept to production

Haml – markup language

Google AI Introduces Proofread: A Novel Gboard Feature Enabling Seamless Sentence-Level And Paragraph-Level Corrections With A Single Tap

These popular Xbox games may make it to Switch 2

FIRST Heritage Co-operative Credit Union Issues Alert Following Cyberattack

The best robot vacuum of CES 2025 – and 4 others that impressed us

11 Fancy and Practical UX Design Projects for Web and Mobile

This AI Paper from Huawei Introduces a Theoretical Framework Focused on the Memorization Process and Performance Dynamics of Transformer-based Language Models (LMs)

Related Posts