Future Token Prediction Model FTP: A New AI Training Method for Transformers that Predicts Multiple Future Tokens

The current design of causal language models, such as GPTs, is intrinsically burdened with the challenge of semantic coherence over longer stretches because of their one-token-ahead prediction design. This has enabled significant generative AI development but often leads to â€œtopic driftâ€ when longer sequences are produced since each token predicted depends only on the presence of mere preceding tokens, not from a broader perspective. This narrows the practical usefulness of these models in complex real-world applications with strict topic adherence, such as narrative generation, content creation, and coding tasks. Overcoming this challenge by enabling multi-token prediction would greatly improve semantic continuity, accuracy, and coherence of the generated sequences of the current generative language models.

There have been various ways through which multi-token prediction has been addressed, each with different limitations. Models that aim to make predictions for multiple tokens by splitting embeddings or having multiple language heads are computationally intensive and often donâ€™t perform well. For Seq2Seq models in encoder-decoder sets, while this allows for multi-token prediction, they fail to capture past contexts into one single embedding; hence, a lot of inefficiencies result. While BERT and other masked language models can predict multiple tokens of a sequence that are masked, they fail in left-to-right generation, hence restricting their use in sequential text prediction. ProphetNet, on the other hand, uses an n-gram prediction strategy; nonetheless, this is not flexible across a wide range of data types. The basic drawbacks of the aforementioned methods are scalability issues, computational waste, and generally unimpressive results while generating high-quality predictions over long-context problems.

The researchers from EPFL introduce the Future Token Prediction model, representing a new architecture to create broader context-aware token embeddings. This will enable seamless multi-token predictions where, in contrast with standard models, the embedding from the top layers is used by a transformer encoder to provide â€œpseudo-sequencesâ€ cross-attended by a small transformer decoder for next-token predictions. In this way, the model leverages such encoder-decoder capability of the FTP for retaining context information from tokens of the previous history to make smoother transitions and maintain topic coherence across multi-token predictions. With more widespread sequence context encoded within its embeddings, FTP provides stronger continuity for generated sequences and has become one of the best approaches to content generation and other applications that require long-form semantic coherence.

The FTP model employs a modified GPT-2 architecture that is made up of a 12-layer encoder with a 3-layer decoder. Its encoder generates token embeddings that are linearly projected to higher dimensionality into a 12-dimensional pseudo-sequence that the decoder cross-attends over to make sense of sequence context. It shares embedding weights between the encoder and decoder; it is trained on OpenWebText data and uses the GPT-2 tokenizer. Meanwhile, optimization is done by AdamW, with a batch size of 500 and a learning rate of 4e-4. There is the gamma parameter set to 0.8 in this model to progressively discount the attention given to tokens far into the future so that immediate predictions can remain highly accurate. This way, the FTP model manages to keep semantic coherence without substantial computational overhead and thus finds an optimum trade-off between efficiency and performance.

These results and evaluation indeed show that the model brings significant improvements compared to traditional GPTs on many key performance metrics: significant reductions in perplexity, better predictive accuracy, and enhanced stability for long-sequence tasks. It also yields higher recall, precision, and F1 scores in BERT-based assessments of textual quality, which would further imply improved semantic alignment against actual text sequences. It also outperforms GPT models on text classification tasks like the IMDB and Amazon reviews and always provides better validation loss with higher accuracy. More importantly, FTP follows the topic of the generated text more coherently, supported by higher cosine similarity scores in long-sequence evaluations, further establishing its prowess for coherent, contextually relevant content generation across more varied applications.

The FTP model represents a paradigm shift in causal language modeling, one that develops the most critical inefficiencies of the classic single-token methods into an embedding that supports wider and context-sensitive views for making multi-token predictions. By enhancing both the accuracy of prediction and semantic coherence, this difference is underlined by improved scores across both perplexity and BERT-based metrics for a wide range of tasks. The pseudo-sequence cross-attention mechanism within this model enhances generative AI by pulling consistent narrative flowâ€”an important requirement for high value in topic-coherent language modeling across applications that require semantic integrity.

Check out the Paper. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter and join ourÂ Telegram Channel andÂ LinkedIn Group. If you like our work, you will love ourÂ newsletter.. Donâ€™t Forget to join ourÂ 55k+ ML SubReddit.

The post Future Token Prediction Model FTP: A New AI Training Method for Transformers that Predicts Multiple Future Tokens appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

Sam Altman says ChatGPT’s viral Ghibli effect “forced OpenAI to do a lot of unnatural things”

How to get started with Microsoft Copilot on Windows 11

Microsoft blocks employees from sending emails that mention “Palestine” or “Gaza”

I missed out on the Clair Obscur: Expedition 33 Collector’s Edition but thankfully, the developers are launching something special

Perficient is Shaping the Future of Salesforce Innovation

Perficient is Shaping the Future of Salesforce Innovation

Opal – Optimizely’s AI-Powered Marketing Assistant

Content Compliance Without the Chaos: How Optimizely CMP Empowers Financial Services Marketers

Sam Altman says ChatGPT’s viral Ghibli effect “forced OpenAI to do a lot of unnatural things”

Sam Altman says ChatGPT’s viral Ghibli effect “forced OpenAI to do a lot of unnatural things”

How to get started with Microsoft Copilot on Windows 11

Microsoft blocks employees from sending emails that mention “Palestine” or “Gaza”

Future Token Prediction Model FTP: A New AI Training Method for Transformers that Predicts Multiple Future Tokens

Researchers from the National University of Singapore Introduce ‘Thinkless,’ an Adaptive Framework that Reduces Unnecessary Reasoning by up to 90% Using DeGRPO

Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

Distribution Release: Finnix 126

Step-by-Step Guide to Creating a Recommendation System

Last Week in AI #285 – a Bunch of New Open Source LLMs and SB 1047 Developments

AI updates from the past week: IBM watsonx Orchestrate updates, web search in Anthropic API, and more — May 9, 2025

Announcing 150M developers and a new free tier for GitHub Copilot in VS Code

This AI Paper from Snowflake Evaluates GPT-4 Models Integrated with OCR and Vision for Enhanced Text and Image Analysis: Advancing Document Understanding

Entity Disambiguation via Fusion Entity Decoding

In search of the foolproof AI watermark

Future Token Prediction Model FTP: A New AI Training Method for Transformers that Predicts Multiple Future Tokens

Related Posts