Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

      June 4, 2025

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 4, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 4, 2025

      Smashing Animations Part 4: Optimising SVGs

      June 4, 2025

      I test AI tools for a living. Here are 3 image generators I actually use and how

      June 4, 2025

      The world’s smallest 65W USB-C charger is my latest travel essential

      June 4, 2025

      This Spotlight alternative for Mac is my secret weapon for AI-powered search

      June 4, 2025

      Tech prophet Mary Meeker just dropped a massive report on AI trends – here’s your TL;DR

      June 4, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Beyond AEM: How Adobe Sensei Powers the Full Enterprise Experience

      June 4, 2025
      Recent

      Beyond AEM: How Adobe Sensei Powers the Full Enterprise Experience

      June 4, 2025

      Simplify Negative Relation Queries with Laravel’s whereDoesntHaveRelation Methods

      June 4, 2025

      Cast Model Properties to a Uri Instance in 12.17

      June 4, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      My Favorite Obsidian Plugins and Their Hidden Settings

      June 4, 2025
      Recent

      My Favorite Obsidian Plugins and Their Hidden Settings

      June 4, 2025

      Rilasciata /e/OS 3.0: Nuova Vita per Android Senza Google, Più Privacy e Controllo per l’Utente

      June 4, 2025

      Rilasciata Oracle Linux 9.6: Scopri le Novità e i Miglioramenti nella Sicurezza e nelle Prestazioni

      June 4, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Building a Legal AI Chatbot: A Step-by-Step Guide Using bigscience/T0pp LLM, Open-Source NLP Models, Streamlit, PyTorch, and Hugging Face Transformers

    Building a Legal AI Chatbot: A Step-by-Step Guide Using bigscience/T0pp LLM, Open-Source NLP Models, Streamlit, PyTorch, and Hugging Face Transformers

    February 24, 2025

    In this tutorial, we will build an efficient Legal AI CHatbot using open-source tools. It provides a step-by-step guide to creating a chatbot using bigscience/T0pp LLM, Hugging Face Transformers, and PyTorch. We will walk you through setting up the model, optimizing performance using PyTorch, and ensuring an efficient and accessible AI-powered legal assistant.

    Copy CodeCopiedUse a different Browser
    from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
    
    
    model_name = "bigscience/T0pp"  # Open-source and available
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
    

    First, we load bigscience/T0pp, an open-source LLM, using Hugging Face Transformers. It initializes a tokenizer for text preprocessing and loads the AutoModelForSeq2SeqLM, enabling the model to perform text generation tasks such as answering legal queries.

    Copy CodeCopiedUse a different Browser
    import spacy
    import re
    
    
    nlp = spacy.load("en_core_web_sm")
    
    
    def preprocess_legal_text(text):
        text = text.lower()
        text = re.sub(r's+', ' ', text)  # Remove extra spaces
        text = re.sub(r'[^a-zA-Z0-9s]', '', text)  # Remove special characters
        doc = nlp(text)
        tokens = [token.lemma_ for token in doc if not token.is_stop]  # Lemmatization
        return " ".join(tokens)
    
    
    sample_text = "The contract is valid for 5 years, terminating on December 31, 2025."
    print(preprocess_legal_text(sample_text))

    Then, we preprocess legal text using spaCy and regular expressions to ensure cleaner and more structured input for NLP tasks. It first converts text to lowercase, removes extra spaces and special characters using regex, and then tokenizes and lemmatizes the text using spaCy’s NLP pipeline. Additionally, it filters out stop words to retain only meaningful terms, making it ideal for legal text processing in AI applications. The cleaned text is more efficient for machine learning and language models like bigscience/T0pp, improving accuracy in legal chatbot responses.

    Copy CodeCopiedUse a different Browser
    def extract_legal_entities(text):
        doc = nlp(text)
        entities = [(ent.text, ent.label_) for ent in doc.ents]
        return entities
    
    
    sample_text = "Apple Inc. signed a contract with Microsoft on June 15, 2023."
    print(extract_legal_entities(sample_text))

    Here, we extract legal entities from text using spaCy’s Named Entity Recognition (NER) capabilities. The function processes the input text with spaCy’s NLP model, identifying and extracting key entities such as organizations, dates, and legal terms. It returns a list of tuples, each containing the recognized entity and its category (e.g., organization, date, or law-related term).

    Copy CodeCopiedUse a different Browser
    import faiss
    import numpy as np
    import torch
    from transformers import AutoModel, AutoTokenizer
    
    
    embedding_model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
    embedding_tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
    
    
    def embed_text(text):
        inputs = embedding_tokenizer(text, return_tensors="pt", padding=True, truncation=True)
        with torch.no_grad():
            output = embedding_model(**inputs)
        embedding = output.last_hidden_state.mean(dim=1).squeeze().cpu().numpy()  # Ensure 1D vector
        return embedding
    
    
    legal_docs = [
        "A contract is legally binding if signed by both parties.",
        "An NDA prevents disclosure of confidential information.",
        "A non-compete agreement prohibits working for a competitor."
    ]
    
    
    doc_embeddings = np.array([embed_text(doc) for doc in legal_docs])
    
    
    print("Embeddings Shape:", doc_embeddings.shape)  # Should be (num_samples, embedding_dim)
    
    
    index = faiss.IndexFlatL2(doc_embeddings.shape[1])  # Dimension should match embedding size
    index.add(doc_embeddings)
    
    
    query = "What happens if I break an NDA?"
    query_embedding = embed_text(query).reshape(1, -1)  # Reshape for FAISS
    _, retrieved_indices = index.search(query_embedding, 1)
    
    
    print(f"Best matching legal text: {legal_docs[retrieved_indices[0][0]]}")

    With the above code, we build a legal document retrieval system using FAISS for efficient semantic search. It first loads the MiniLM embedding model from Hugging Face to generate numerical representations of text. The embed_text function processes legal documents and queries by computing contextual embeddings using MiniLM. These embeddings are stored in a FAISS vector index, allowing fast similarity searches.

    Copy CodeCopiedUse a different Browser
    def legal_chatbot(query):
        inputs = tokenizer(query, return_tensors="pt", padding=True, truncation=True)
        output = model.generate(**inputs, max_length=100)
        return tokenizer.decode(output[0], skip_special_tokens=True)
    
    
    query = "What happens if I break an NDA?"
    print(legal_chatbot(query))

    Finally, we define a Legal AI Chatbot as generating responses to legal queries using a pre-trained language model. The legal_chatbot function takes a user query, processes it using the tokenizer, and generates a response with the model. The response is then decoded into readable text, removing any special tokens. When a query like “What happens if I break an NDA?” is input, the chatbot provides a relevant AI-generated legal response.

    In conclusion, by integrating bigscience/T0pp LLM, Hugging Face Transformers, and PyTorch, we have demonstrated how to build a powerful and scalable Legal AI Chatbot using open-source resources. This project is a solid foundation for creating reliable AI-powered legal tools, making legal assistance more accessible and automated.


    Here is the Colab Notebook for the above project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 80k+ ML SubReddit.

    🚨 Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets

    The post Building a Legal AI Chatbot: A Step-by-Step Guide Using bigscience/T0pp LLM, Open-Source NLP Models, Streamlit, PyTorch, and Hugging Face Transformers appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleMicrosoft Researchers Introduces BioEmu-1: A Deep Learning Model that can Generate Thousands of Protein Structures Per Hour on a Single GPU
    Next Article Optimizing Training Data Allocation Between Supervised and Preference Finetuning in Large Language Models

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 4, 2025
    Machine Learning

    A Coding Implementation to Build an Advanced Web Intelligence Agent with Tavily and Gemini AI

    June 4, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Laravel Debounce

    Development

    Bill Gates says, “AI will replace humans for most things” — creating a 2-day work week in 10 years, and Copilot says it’s good for your mental health

    News & Updates

    3 ways Google’s AI Mode is going to change how you shop online

    News & Updates

    AMD’s most popular gaming CPU drops to a pre-2025 price with this deal

    News & Updates

    Highlights

    Development

    Future-Proofing the Past: AI’s Role in Protecting Cultural Legacies

    July 10, 2024

    The world’s cultural heritage faces mounting peril from escalating conflicts and natural disasters, jeopardizing ancient…

    Fujitsu Data Breach: No Ransomware, But Advanced Attack Evades Detection

    July 10, 2024

    What’s better than a power bank doubling as a hotspot? Its low price

    January 7, 2025

    The Hoodie Man by the Mountain

    August 31, 2024
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.