Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Sunshine And March Vibes (2025 Wallpapers Edition)

      June 2, 2025

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 2, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 2, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 2, 2025

      How Red Hat just quietly, radically transformed enterprise server Linux

      June 2, 2025

      OpenAI wants ChatGPT to be your ‘super assistant’ – what that means

      June 2, 2025

      The best Linux VPNs of 2025: Expert tested and reviewed

      June 2, 2025

      One of my favorite gaming PCs is 60% off right now

      June 2, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      `document.currentScript` is more useful than I thought.

      June 2, 2025
      Recent

      `document.currentScript` is more useful than I thought.

      June 2, 2025

      Adobe Sensei and GenAI in Practice for Enterprise CMS

      June 2, 2025

      Over The Air Updates for React Native Apps

      June 2, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      You can now open ChatGPT on Windows 11 with Win+C (if you change the Settings)

      June 2, 2025
      Recent

      You can now open ChatGPT on Windows 11 with Win+C (if you change the Settings)

      June 2, 2025

      Microsoft says Copilot can use location to change Outlook’s UI on Android

      June 2, 2025

      TempoMail — Command Line Temporary Email in Linux

      June 2, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»A Step-by-Step Guide to Setting Up a Custom BPE Tokenizer with Tiktoken for Advanced NLP Applications in Python

    A Step-by-Step Guide to Setting Up a Custom BPE Tokenizer with Tiktoken for Advanced NLP Applications in Python

    February 17, 2025

    In this tutorial, we’ll learn how to create a custom tokenizer using the tiktoken library. The process involves loading a pre-trained tokenizer model, defining both base and special tokens, initializing the tokenizer with a specific regular expression for token splitting, and testing its functionality by encoding and decoding some sample text. This setup is essential for NLP tasks requiring precise control over text tokenization.

    Copy CodeCopiedUse a different Browser
    from pathlib import Path
    import tiktoken
    from tiktoken.load import load_tiktoken_bpe
    import json

    Here, we import several libraries essential for text processing and machine learning. It uses Path from pathlib for easy file path management, while tiktoken and load_tiktoken_bpe facilitate loading and working with a Byte Pair Encoding tokenizer.

    Copy CodeCopiedUse a different Browser
    tokenizer_path = "./content/tokenizer.model"
    num_reserved_special_tokens = 256
    
    
    mergeable_ranks = load_tiktoken_bpe(tokenizer_path)
    
    
    num_base_tokens = len(mergeable_ranks)
    special_tokens = [
        "<|begin_of_text|>",
        "<|end_of_text|>",
        "<|reserved_special_token_0|>",
        "<|reserved_special_token_1|>",
        "<|finetune_right_pad_id|>",
        "<|step_id|>",
        "<|start_header_id|>",
        "<|end_header_id|>",
        "<|eom_id|>",
        "<|eot_id|>",
        "<|python_tag|>",
    ]
    

    Here, we set the path to the tokenizer model, specifying 256 reserved special tokens. It then loads the mergeable ranks, which form the base vocabulary, calculates the number of base tokens, and defines a list of special tokens for marking text boundaries and other reserved purposes.

    Copy CodeCopiedUse a different Browser
    reserved_tokens = [
        f"<|reserved_special_token_{2 + i}|>"
        for i in range(num_reserved_special_tokens - len(special_tokens))
    ]
    special_tokens = special_tokens + reserved_tokens
    
    
    tokenizer = tiktoken.Encoding(
        name=Path(tokenizer_path).name,
        pat_str=r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^rnp{L}p{N}]?p{L}+|p{N}{1,3}| ?[^sp{L}p{N}]+[rn]*|s*[rn]+|s+(?!S)|s+",
        mergeable_ranks=mergeable_ranks,
        special_tokens={token: len(mergeable_ranks) + i for i, token in enumerate(special_tokens)},
    )
    

    Now, we dynamically create additional reserved tokens to reach 256, then append them to the predefined special tokens list. It initializes the tokenizer using tiktoken. Encoding with a specified regular expression for splitting text, the loaded mergeable ranks as the base vocabulary, and mapping special tokens to unique token IDs.

    Copy CodeCopiedUse a different Browser
    #-------------------------------------------------------------------------
    # Test the tokenizer with a sample text
    #-------------------------------------------------------------------------
    sample_text = "Hello, this is a test of the updated tokenizer!"
    encoded = tokenizer.encode(sample_text)
    decoded = tokenizer.decode(encoded)
    
    
    print("Sample Text:", sample_text)
    print("Encoded Tokens:", encoded)
    print("Decoded Text:", decoded)

    We test the tokenizer by encoding a sample text into token IDs and then decoding those IDs back into text. It prints the original text, the encoded tokens, and the decoded text to confirm that the tokenizer works correctly.

    Copy CodeCopiedUse a different Browser
    tokenizer.encode("Hey")

    Here, we encode the string “Hey” into its corresponding token IDs using the tokenizer’s encoding method.

    In conclusion, following this tutorial will teach you how to set up a custom BPE tokenizer using the TikToken library. You saw how to load a pre-trained tokenizer model, define both base and special tokens, and initialize the tokenizer with a specific regular expression for token splitting. Finally, you verified the tokenizer’s functionality by encoding and decoding sample text. This setup is a fundamental step for any NLP project that requires customized text processing and tokenization.


    Here is the Colab Notebook for the above project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

    🚨 Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)

    The post A Step-by-Step Guide to Setting Up a Custom BPE Tokenizer with Tiktoken for Advanced NLP Applications in Python appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleUsing Amazon Rekognition to improve bicycle safety
    Next Article Enhancing Reasoning Capabilities in Low-Resource Language Models through Efficient Model Merging

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 2, 2025
    Machine Learning

    MiMo-VL-7B: A Powerful Vision-Language Model to Enhance General Visual Understanding and Multimodal Reasoning

    June 2, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    The Secret Sauce to Winning Clients Worth $25K Daily: ChatGPT + LinkedIn + Quora

    Artificial Intelligence

    Bring back the fun with p5.js 2.0

    Development

    APT29 Hackers Target High-Value Victims Using Rogue RDP Servers and PyRDP

    Development

    CVE-2025-47679 – RS WP Book Showcase Cross-site Scripting (XSS)

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    Linux

    Rilasciato Incus 6.13: Gestore di Container e Macchine Virtuali

    May 31, 2025

    Incus è un potente gestore di container e macchine virtuali, noto per la sua versatilità…

    CVE-2025-44614 – Tinxy WiFi Lock Controller Plaintext Storage Vulnerability

    May 30, 2025

    Breaking Down AD CS Vulnerabilities: Insights for InfoSec Professionals

    August 30, 2024

    ClonOS is a FreeBSD based distro for virtual hosting platform and appliance

    May 26, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.