Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      June 1, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 1, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 1, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 1, 2025

      My top 5 must-play PC games for the second half of 2025 — Will they live up to the hype?

      June 1, 2025

      A week of hell with my Windows 11 PC really makes me appreciate the simplicity of Google’s Chromebook laptops

      June 1, 2025

      Elden Ring Nightreign Night Aspect: How to beat Heolstor the Nightlord, the final boss

      June 1, 2025

      New Xbox games launching this week, from June 2 through June 8 — Zenless Zone Zero finally comes to Xbox

      June 1, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Student Record Android App using SQLite

      June 1, 2025
      Recent

      Student Record Android App using SQLite

      June 1, 2025

      When Array uses less memory than Uint8Array (in V8)

      June 1, 2025

      Laravel 12 Starter Kits: Definite Guide Which to Choose

      June 1, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      My top 5 must-play PC games for the second half of 2025 — Will they live up to the hype?

      June 1, 2025
      Recent

      My top 5 must-play PC games for the second half of 2025 — Will they live up to the hype?

      June 1, 2025

      A week of hell with my Windows 11 PC really makes me appreciate the simplicity of Google’s Chromebook laptops

      June 1, 2025

      Elden Ring Nightreign Night Aspect: How to beat Heolstor the Nightlord, the final boss

      June 1, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Enhancing Lexicon-Based Text Embeddings with Large Language Models

    Enhancing Lexicon-Based Text Embeddings with Large Language Models

    January 21, 2025

    Lexicon-based embeddings are one of the good alternatives to dense embeddings, yet they face numerous challenges that restrain their wider adoption. One key problem is tokenization redundancy, whereby subword tokenization breaks semantically equivalent tokens, causing inefficiencies and inconsistencies in embeddings. The other limitation of causal LLMs is unidirectional attention; this means tokens cannot fully leverage the surrounding context while pretraining. These challenges confine the adaptability and efficiency of lexicon-based embeddings, especially in tasks beyond information retrieval, thereby making it necessary to utilize a stronger approach than the previous ones to make them more useful.

    Several techniques have been proposed to investigate the possibility of usages through lexicon-based embeddings. SPLADE uses bidirectional attention and aligns the embeddings with language modeling objectives, while PromptReps utilizes prompt engineering to produce lexicon-based embeddings in causal LLMs. SPLADE is limited to smaller models and a specific retrieval task, limiting its applicability. The PromptReps model lacks contextual understanding with unidirectional attention, and the performance obtained is suboptimal. This class of methods often has very high computational complexity, inefficiency due to fragmenting tokenization, and lacks scope for larger applications such as clustering and classification.

    Researchers from the University of Amsterdam, the University of Technology Sydney, and Tencent IEG propose LENS (Lexicon-based EmbeddiNgS), a groundbreaking framework designed to address the limitations of current lexicon-based embedding techniques. Through applying KMeans for the clustering of semantically analogous tokens, LENS effectively amalgamates token embeddings, thereby minimizing redundancy and dimensionality. This streamlined representation facilitates the creation of embeddings characterized by reduced dimensions, all the while maintaining semantic depth. In addition, bidirectional attention is used to overcome the contextual constraints imposed by unidirectional attention used by causal LLMs. It allows tokens to fully utilize their context on both sides. Several experimental experiments showed that max-pooling works the best among pooling strategies for word embeddings. Finally, LENS is used along with dense embeddings. The hybrid embedding uses the best features of the two approaches. Hence, it performs better over a wide range of tasks. These enhancements make LENS a versatile and efficient framework for producing interpretable and contextually-aware embeddings that can be used for clustering, classification, and retrieval applications.

    LENS uses clustering to replace the original token embeddings with cluster centroids in the language modeling head, removing redundancy and the dimensionality of the embedding. The embeddings are then 4,000 or 8,000 dimensional and as efficient and scalable as dense embeddings. In the fine-tuning stage, the model incorporates bidirectional attention that improves contextual understanding since tokens can make informed decisions based on their full context. The framework is based on the Mistral-7B model; the datasets are public and include tasks such as retrieval, clustering, classification, and semantic textual similarity. The training methodology is optimized using a streamlined single-stage pipeline and InfoNCE loss for enhancing embeddings. This methodology guarantees ease of use, the ability to scale, and strong performance on tasks, thereby rendering the framework suitable for a range of applications.

    LENS exhibits remarkable efficacy across various benchmarks, such as the Massive Text Embedding Benchmark (MTEB) and AIR-Bench. Within the MTEB framework, LENS-8000 attains the highest mean score among models trained publicly, outpacing dense embeddings in six of seven task classifications. The more compact LENS-4000 model also demonstrates competitive performance, highlighting its scalability and efficiency. With dense embeddings, LENS shows strong advantages, establishing new baselines in retrieval tasks and uniformly providing improvements over a range of datasets. Qualitative evaluations demonstrate that LENS flawlessly achieves semantic associations, reduces noise in tokenization, and yields very compact and informative embeddings. The solid generalization capabilities of the framework and competitive performance in out-of-domain tasks further establish its versatility and applicability for a wide variety of tasks.

    LENS is an exciting step forward for lexicon-based embedding models because they address tokenization redundancy while being more effective than other approaches that improve contextual representations through bidirectional attention mechanisms. The compactness, efficiency, and interpretability of the model are seen across a wide spectrum of tasks: from retrieval to clustering and classification tasks, which demonstrate superior performance against traditional approaches. Moreover, this effectiveness in dense environments points towards its revolutionary possibility in text representation. Future studies may extend this study by adding multi-lingual datasets and bigger models to augment its significance and relevance within the domain of artificial intelligence.


    Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

    🚨 [Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)

    The post Enhancing Lexicon-Based Text Embeddings with Large Language Models appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleCreate Portrait Mode Effect with Segment Anything Model 2 (SAM2)
    Next Article The new competitive moat is emotion

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 1, 2025
    Machine Learning

    Enigmata’s Multi-Stage and Mix-Training Reinforcement Learning Recipe Drives Breakthrough Performance in LLM Puzzle Reasoning

    June 1, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Unlock the power of data governance and no-code machine learning with Amazon SageMaker Canvas and Amazon DataZone

    Development

    It’s Time To Talk About “CSS5”

    Development

    Windows New Outlook will soon get offline Calendar support

    Operating Systems

    Picket – screen color picker

    Linux

    Highlights

    Development

    Multi-Task Learning with Regression and Classification Tasks: MTLComb

    May 21, 2024

    In machine learning, multi-task learning (MTL) has emerged as a powerful paradigm that enables concurrent…

    Introducing SpicyCamCast: A Lightweight Browser-Based Screencast Solution

    January 29, 2025

    Distribution Release: Regata OS 25.0.3

    April 10, 2025

    Windows 11’s PowerToys now lets you easily record screen, annotate technical presentations

    February 1, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.