Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      June 1, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 1, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 1, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 1, 2025

      7 MagSafe accessories that I recommend every iPhone user should have

      June 1, 2025

      I replaced my Kindle with an iPad Mini as my ebook reader – 8 reasons why I don’t regret it

      June 1, 2025

      Windows 11 version 25H2: Everything you need to know about Microsoft’s next OS release

      May 31, 2025

      Elden Ring Nightreign already has a duos Seamless Co-op mod from the creator of the beloved original, and it’ll be “expanded on in the future”

      May 31, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Student Record Android App using SQLite

      June 1, 2025
      Recent

      Student Record Android App using SQLite

      June 1, 2025

      When Array uses less memory than Uint8Array (in V8)

      June 1, 2025

      Laravel 12 Starter Kits: Definite Guide Which to Choose

      June 1, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Photobooth is photobooth software for the Raspberry Pi and PC

      June 1, 2025
      Recent

      Photobooth is photobooth software for the Raspberry Pi and PC

      June 1, 2025

      Le notizie minori del mondo GNU/Linux e dintorni della settimana nr 22/2025

      June 1, 2025

      Rilasciata PorteuX 2.1: Novità e Approfondimenti sulla Distribuzione GNU/Linux Portatile Basata su Slackware

      June 1, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»A Step-by-Step Guide to Setting Up a Custom BPE Tokenizer with Tiktoken for Advanced NLP Applications in Python

    A Step-by-Step Guide to Setting Up a Custom BPE Tokenizer with Tiktoken for Advanced NLP Applications in Python

    February 17, 2025

    In this tutorial, we’ll learn how to create a custom tokenizer using the tiktoken library. The process involves loading a pre-trained tokenizer model, defining both base and special tokens, initializing the tokenizer with a specific regular expression for token splitting, and testing its functionality by encoding and decoding some sample text. This setup is essential for NLP tasks requiring precise control over text tokenization.

    Copy CodeCopiedUse a different Browser
    from pathlib import Path
    import tiktoken
    from tiktoken.load import load_tiktoken_bpe
    import json

    Here, we import several libraries essential for text processing and machine learning. It uses Path from pathlib for easy file path management, while tiktoken and load_tiktoken_bpe facilitate loading and working with a Byte Pair Encoding tokenizer.

    Copy CodeCopiedUse a different Browser
    tokenizer_path = "./content/tokenizer.model"
    num_reserved_special_tokens = 256
    
    
    mergeable_ranks = load_tiktoken_bpe(tokenizer_path)
    
    
    num_base_tokens = len(mergeable_ranks)
    special_tokens = [
        "<|begin_of_text|>",
        "<|end_of_text|>",
        "<|reserved_special_token_0|>",
        "<|reserved_special_token_1|>",
        "<|finetune_right_pad_id|>",
        "<|step_id|>",
        "<|start_header_id|>",
        "<|end_header_id|>",
        "<|eom_id|>",
        "<|eot_id|>",
        "<|python_tag|>",
    ]
    

    Here, we set the path to the tokenizer model, specifying 256 reserved special tokens. It then loads the mergeable ranks, which form the base vocabulary, calculates the number of base tokens, and defines a list of special tokens for marking text boundaries and other reserved purposes.

    Copy CodeCopiedUse a different Browser
    reserved_tokens = [
        f"<|reserved_special_token_{2 + i}|>"
        for i in range(num_reserved_special_tokens - len(special_tokens))
    ]
    special_tokens = special_tokens + reserved_tokens
    
    
    tokenizer = tiktoken.Encoding(
        name=Path(tokenizer_path).name,
        pat_str=r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^rnp{L}p{N}]?p{L}+|p{N}{1,3}| ?[^sp{L}p{N}]+[rn]*|s*[rn]+|s+(?!S)|s+",
        mergeable_ranks=mergeable_ranks,
        special_tokens={token: len(mergeable_ranks) + i for i, token in enumerate(special_tokens)},
    )
    

    Now, we dynamically create additional reserved tokens to reach 256, then append them to the predefined special tokens list. It initializes the tokenizer using tiktoken. Encoding with a specified regular expression for splitting text, the loaded mergeable ranks as the base vocabulary, and mapping special tokens to unique token IDs.

    Copy CodeCopiedUse a different Browser
    #-------------------------------------------------------------------------
    # Test the tokenizer with a sample text
    #-------------------------------------------------------------------------
    sample_text = "Hello, this is a test of the updated tokenizer!"
    encoded = tokenizer.encode(sample_text)
    decoded = tokenizer.decode(encoded)
    
    
    print("Sample Text:", sample_text)
    print("Encoded Tokens:", encoded)
    print("Decoded Text:", decoded)

    We test the tokenizer by encoding a sample text into token IDs and then decoding those IDs back into text. It prints the original text, the encoded tokens, and the decoded text to confirm that the tokenizer works correctly.

    Copy CodeCopiedUse a different Browser
    tokenizer.encode("Hey")

    Here, we encode the string “Hey” into its corresponding token IDs using the tokenizer’s encoding method.

    In conclusion, following this tutorial will teach you how to set up a custom BPE tokenizer using the TikToken library. You saw how to load a pre-trained tokenizer model, define both base and special tokens, and initialize the tokenizer with a specific regular expression for token splitting. Finally, you verified the tokenizer’s functionality by encoding and decoding sample text. This setup is a fundamental step for any NLP project that requires customized text processing and tokenization.


    Here is the Colab Notebook for the above project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

    🚨 Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)

    The post A Step-by-Step Guide to Setting Up a Custom BPE Tokenizer with Tiktoken for Advanced NLP Applications in Python appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleUsing Amazon Rekognition to improve bicycle safety
    Next Article Enhancing Reasoning Capabilities in Low-Resource Language Models through Efficient Model Merging

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 1, 2025
    Machine Learning

    BOND 2025 AI Trends Report Shows AI Ecosystem Growing Faster than Ever with Explosive User and Developer Adoption

    June 1, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    iPhone 16 Mockup for Figma

    Development

    Healthcare Content Audit: 6 More Benefits for Your Organization

    Development

    Use Huge Icons As React Components – hugeicons-react

    Development

    HP shows off three next-gen laptops and revamps it desktops at CES 2025

    Development

    Highlights

    Development

    Arcee AI Introduces Arcee Agent: A Cutting-Edge 7B Parameter Language Model Specifically Designed for Function Calling and Tool Use

    July 6, 2024

    Arcee AI has recently released its latest innovation, the Arcee Agent, a state-of-the-art 7 billion…

    Japan Considers Measures To Mandate Private Sector Cybersecurity Incident Reporting

    August 6, 2024

    The ClipboardItem.supports() function is now Baseline Newly available

    May 20, 2025

    Exploring institutions for global AI governance

    May 27, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.