Outstanding results in various tasks, including document generation/summarization, machine translation, and speech recognition, have propelled the Transformer architecture to the forefront of Natural Language Processing (NLP). Large language models (LLMs) have recently emerged as the dominant model due to their ability to solve ever-increasingly difficult tasks by scaling up the Transformer structure. Nevertheless, the attention mechanism necessitates cross-correlation calculations between each token, increasing the processing needs associated with this scaling. These models’ processing needs, inference costs, and energy consumption pose substantial challenges when trying to deploy them in situations with limited resources, such as mobile devices and robotics.
Studies have focused on improving the Transformer architecture to meet the urgent demand for more efficient Transformer models. Model pruning, quantization, and the creation of more effective attention processes are just a few of the many approaches that have been proposed. Simplifying the attention process is one of the most promising of these initiatives. This method aims to simplify attention mechanisms from their quadratic complexity to a more tractable linear scale. However, most current optimization strategies for Transformers require extensive retraining, especially regarding their attention processes. This retraining procedure is quite difficult, particularly for models that have a huge number of parameters. The time and computational resources needed to complete it are substantial.
Researchers from Peking University and Huawei Noah’s Ark Lab carried out a comprehensive review of current linear attention techniques to tackle the problem of fast attention approximations in big language models. They found that Monte Carlo sampling is the major culprit in these approaches’ approximation errors.
The team introduces DiJiang, a Frequency Domain Kernelization method, a novel approach in Natural Language Processing. This method, a type of weighted Quasi-Monte Carlo sampling, uses the Discrete Cosine Transform (DCT) to efficiently and precisely transfer the Transformer’s queries and keys to the frequency domain. By doing so, it simplifies the attention computation by removing the softmax operation from the attention mechanism. This innovative approach ensures that training costs for the adaptation from a vanilla Transformer to a linear attention model are kept modest.
The team’s comprehensive trials confirm that DiJiang accomplishes performance comparable to traditional Transformers while simultaneously improving inference speeds and reducing training costs by approximately ten times. What’s more, this method also benefits from higher inference speeds, which can reach up to ten times faster. This frequency domain mapping is shown to be roughly equal to the original attention mechanism in their theoretical demonstration. Promising broader applicability and facilitating breakthroughs in different tasks within natural language processing and beyond, this technology marks a substantial advancement in the creation of efficient and scalable Transformer models.Â
Check out the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.
If you like our work, you will love our newsletter..
Don’t Forget to join our 39k+ ML SubReddit
The post DiJiang: A Groundbreaking Frequency Domain Kernelization Method Designed to Address the Computational Inefficiencies Inherent in Traditional Transformer Models appeared first on MarkTechPost.
Source: Read MoreÂ