Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Error’d: Pickup Sticklers

      September 27, 2025

      From Prompt To Partner: Designing Your Custom AI Assistant

      September 27, 2025

      Microsoft unveils reimagined Marketplace for cloud solutions, AI apps, and more

      September 27, 2025

      Design Dialects: Breaking the Rules, Not the System

      September 27, 2025

      Building personal apps with open source and AI

      September 12, 2025

      What Can We Actually Do With corner-shape?

      September 12, 2025

      Craft, Clarity, and Care: The Story and Work of Mengchu Yao

      September 12, 2025

      Cailabs secures €57M to accelerate growth and industrial scale-up

      September 12, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Using phpinfo() to Debug Common and Not-so-Common PHP Errors and Warnings

      September 28, 2025
      Recent

      Using phpinfo() to Debug Common and Not-so-Common PHP Errors and Warnings

      September 28, 2025

      Mastering PHP File Uploads: A Guide to php.ini Settings and Code Examples

      September 28, 2025

      The first browser with JavaScript landed 30 years ago

      September 27, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured
      Recent
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»How Bag of Words Works – The Foundation of Language Models

    How Bag of Words Works – The Foundation of Language Models

    August 26, 2025

    When people talk about modern AI, they often point to large language models like ChatGPT.

    These models seem smart, as they’re able to write, answer, and explain in natural language.

    But the roots of this technology go back to something very simple: the Bag of Words model. This method, which first appeared decades ago, was one of the earliest ways to turn text into numbers. Without it, the progress we see in natural language processing today would not have been possible.

    In this article, you’ll learn what the Bag of Words algorithm is and write your own code to create a function to generate a bag of words.

    What is Bag of Words?

    Bag of Words, often called BoW, is a method for representing text. It takes a sentence, paragraph, or document and treats it as a “bag” of words.

    Word order, grammar, and sentence structure are ignored. Only the presence or frequency of each word matters.

    Take the sentence

    The cat sat on the mat.
    

    In Bag of Words, it becomes a count of words.

    the:<span class="hljs-number">2</span>, cat:<span class="hljs-number">1</span>, sat:<span class="hljs-number">1</span>, on:<span class="hljs-number">1</span>, mat:<span class="hljs-number">1.</span>
    

    Another sentence like this:

    The mat sat on the cat
    

    looks the same, even though the meaning is different.

    This is both the strength and weakness of BoW. It makes text easy for computers to process but throws away context.

    Why BoW Was a Breakthrough

    Before Bag of Words, computers had no easy way to handle human language. Words are not numbers, and algorithms need numbers to work.

    BoW gave researchers a way to transform messy text into vectors of counts. Once in numeric form, words could be used in statistics, clustering, and machine learning.

    Early applications included spam filters, where certain words like “free” or “win” signaled unwanted emails. Search engines also used it to match queries with documents. For the first time, text could be processed at scale.

    A Simple Bag of Words Example in Python

    Here’s a short example to see Bag of Words in action. We’ll take a few sentences and convert them into word count vectors.

    <span class="hljs-keyword">from</span> sklearn.feature_extraction.text <span class="hljs-keyword">import</span> CountVectorizer
    
    docs = [
        <span class="hljs-string">"the cat sat on the mat"</span>,
        <span class="hljs-string">"the dog barked at the cat"</span>,
        <span class="hljs-string">"the mat was sat on by the cat"</span>
    ]
    
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(docs)
    
    print(<span class="hljs-string">"Vocabulary:"</span>, vectorizer.get_feature_names_out())
    print(<span class="hljs-string">"Document-Term Matrix:n"</span>, X.toarray())
    

    It will give you the following result:

    Bag of Words

    What you’re seeing in the output is the Bag of Words model turning your sentences into numbers. The first line shows the vocabulary, which is the collection of every unique word that appeared across the three input sentences.

    Words like “at,” “barked,” “cat,” “dog,” “mat,” “on,” “sat,” “the,” and “was” all become part of this dictionary. Each position in the vocabulary has a fixed order, and that order is used to map words to columns in the matrix.

    The second part of the output is the document-term matrix. Each row in this matrix represents one document, and each number in the row tells you how many times the word from the vocabulary appeared in that document.

    For example, in the first row, which corresponds to the sentence “the cat sat on the mat,” the values line up with the vocabulary to show that “the” appeared twice, while “cat,” “sat,” “on,” and “mat” each appeared once. Every other word in the vocabulary for that row is a zero, meaning it never showed up in that document.

    This is the essence of Bag of Words. It reduces each sentence to a row of numbers, where meaning and grammar are ignored, and only the counts of words are kept. Instead of working with raw text, the machine now works with a structured table of numbers.

    That simple idea is what made it possible for computers to start analyzing and learning from language.

    Where Bag of Words Falls Short

    As useful as it was, Bag of Words has limits.

    The most obvious one is that it ignores meaning. Sentences with reversed roles (“dog chases cat” vs. “cat chases dog”) end up with the same vector.

    BoW also can’t handle synonyms well. Words like “happy” and “joyful” are treated as different, even though they mean the same thing.

    Another problem is size. If the dataset has thousands of unique words, the vectors become very large and sparse. Most values are zeros, which makes storage and computation less efficient.

    From Bag of Words to Better Models

    Bag of Words inspired better methods. One improvement was TF-IDF, which gave higher weight to rare but important words, and lower weight to common ones like “the.”

    Later came word embeddings such as Word2Vec and GloVe. Instead of counting words, embeddings map them into dense vectors where meanings and relationships are captured. Words like “king” and “queen” end up close together in this space.

    Modern transformers, like BERT and GPT, push this even further. They not only capture word meaning but also context. The word “bank” in “river bank” and “money bank” will have different embeddings depending on the sentence. This is something Bag of Words could never do.

    Why Bag of Words Still Matters

    Even today, Bag of Words is not useless. For small projects with limited data, it can still provide strong results.

    A quick text classifier using BoW often works faster and requires less computing power than training a deep neural network. In teaching, it is also valuable because it shows the first step of turning raw text into machine-readable form.

    More importantly, the core idea of Bag of Words lives on. Large language models still convert text into vectors. The difference is that they do it in a far more complex and meaningful way.

    Bag of Words was the spark that made researchers realize: to process language, we must first represent it as numbers.

    Conclusion

    Bag of Words looks simple, maybe even primitive, compared to the tools we use now. But it was a turning point. It gave computers a way to see text as data, and it laid the foundation for everything that came after. While it can’t capture deep meaning or context, it taught us how to bridge the gap between words and numbers.

    Large language models may seem like magic, but their roots go back to Bag of Words. The journey from counting words in a sentence to transformers with billions of parameters is proof that big revolutions in technology often start with small, simple ideas.

    Hope you enjoyed this article. Signup for my free AI newsletter TuringTalks.ai for more hands-on tutorials on AI. You can also find me on Linkedin.

    Source: freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleHow WebGL and Three.js Power Interactive Online Stores
    Next Article New technologies tackle brain health assessment for the military

    Related Posts

    Development

    Using phpinfo() to Debug Common and Not-so-Common PHP Errors and Warnings

    September 28, 2025
    Development

    Mastering PHP File Uploads: A Guide to php.ini Settings and Code Examples

    September 28, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    I Love Lasagna and Latinas Shirt

    Web Development

    HERE Technologies boosts developer productivity with new generative AI-powered coding assistant

    Machine Learning

    Wired’s Kevin Kelly on Technology, AI, and the Power of Learning

    Development

    CVE-2025-31324 (CVSS 10): Zero-Day in SAP NetWeaver Exploited in the Wild to Deploy Webshells and C2 Frameworks

    Security

    Highlights

    How to Build Autonomous Agents using Prompt Chaining with AI Primitives (No Frameworks)

    April 21, 2025

    Autonomous agents might sound complex, but they don’t have to be. These are AI systems…

    OpenAI Open Sources BrowseComp: A New Benchmark for Measuring the Ability for AI Agents to Browse the Web

    OpenAI Open Sources BrowseComp: A New Benchmark for Measuring the Ability for AI Agents to Browse the Web

    April 10, 2025

    ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering

    June 27, 2025

    Microsoft AI Introduces Magentic-UI: An Open-Source Agent Prototype that Works with People to Complete Complex Tasks that Require Multi-Step Planning and Browser Use

    May 22, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.