Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      CodeSOD: Identify a Nap

      September 23, 2025

      Ambient Animations In Web Design: Principles And Implementation (Part 1)

      September 23, 2025

      Benchmarking AI-assisted developers (and their tools) for superior AI governance

      September 23, 2025

      Digital.ai launches White-box Cryptography Agent to enable stronger application security

      September 23, 2025

      Development Release: MX Linux 25 Beta 1

      September 22, 2025

      DistroWatch Weekly, Issue 1140

      September 21, 2025

      Distribution Release: DietPi 9.17

      September 21, 2025

      Development Release: Zorin OS 18 Beta

      September 19, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Stop using .reverse().find(): meet findLast()

      September 23, 2025
      Recent

      Stop using .reverse().find(): meet findLast()

      September 23, 2025

      @ts-ignore is almost always the worst option

      September 22, 2025

      MutativeJS v1.3.0 is out with massive performance gains

      September 22, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      How I Configure Polybar to Customize My Linux Desktop

      September 23, 2025
      Recent

      How I Configure Polybar to Customize My Linux Desktop

      September 23, 2025

      Development Release: MX Linux 25 Beta 1

      September 22, 2025

      DistroWatch Weekly, Issue 1140

      September 21, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»How to Build Your Own Local AI: Create Free RAG and AI Agents with Qwen 3 and Ollama

    How to Build Your Own Local AI: Create Free RAG and AI Agents with Qwen 3 and Ollama

    May 6, 2025

    The landscape of Artificial Intelligence is rapidly evolving, and one of the most exciting trends is the ability to run powerful Large Language Models (LLMs) directly on your local machine.

    This shift away from reliance on cloud-based APIs offers significant advantages in terms of privacy, cost-effectiveness, and offline accessibility. Developers and enthusiasts can now experiment with and deploy sophisticated AI capabilities without sending data externally or incurring API fees.

    This tutorial serves as a practical, hands-on guide to harnessing this local AI power. It focuses on leveraging the Qwen 3 family of LLMs, a state-of-the-art open-source offering from Alibaba, combined with Ollama, a tool that dramatically simplifies running LLMs locally.

    Prerequisites

    Before diving into this tutorial, you should have a foundational understanding of Python programming and be comfortable using the command line or terminal. Make sure you have Python 3 installed on your system.

    While prior experience with AI or Large Language Models (LLMs) is beneficial, it’s not essential, as I’ll introduce and explain core concepts like Retrieval-Augmented Generation (RAG) and AI agents throughout the guide.

    This tutorial serves as a practical, hands-on guide to harnessing this local AI power. It focuses on leveraging the Qwen 3 family of LLMs, a state-of-the-art open-source offering from Alibaba, combined with Ollama, a tool that dramatically simplifies running LLMs locally.

    Table of Contents

    1. Local AI Power with Qwen 3 and Ollama

      • Ollama: Your Local LLM Gateway

      • Tutorial Roadmap

    1. How to Set Up Your Local AI Lab

      • Install Ollama

      • Choose Your Qwen 3 Model

      • Pull and Run Qwen 3 with Ollama

      • Set Up Your Python Environment

    2. How to Build a Local RAG System with Qwen 3

      • Step 1: Prepare Your Data

      • Step 2: Load Documents in Python

      • Step 3: Split Documents

      • Step 4: Choose and Configure Embedding Model

      • Step 5: Set Up Local Vector Store (ChromaDB)

      • Step 6: Index Documents (Embed and Store)

      • Step 7: Build the RAG Chain

      • Step 8: Query Your Documents

    3. How to Create Local AI Agents with Qwen 3

      • Step 1: Define Custom Tools

      • Step 2: Set up the Agent LLM

      • Step 3: Create the Agent Prompt

      • Step 4: Build the Agent

      • Step 5: Create the Agent Executor

      • Step 6: Run the Agent

    4. Advanced Considerations and Troubleshooting

      • Controlling Qwen 3’s Thinking Mode with Ollama

      • Managing Context Length (num_ctx)

      • Hardware Limitations and VRAM

    5. Conclusion and Next Steps

    Local AI Power with Qwen 3 and Ollama

    Running LLMs locally addresses several key concerns associated with cloud-based AI services.

    • Privacy is paramount – data processed locally never leaves the user’s machine.

    • Cost is another major factor – utilizing open-source models and tools like Ollama eliminates API subscription fees and pay-per-token charges, making advanced AI accessible to everyone.

    • Local execution enables offline functionality – crucial for applications where internet connectivity is unreliable or undesirable.

    Ollama: Your Local LLM Gateway

    Ollama acts as a bridge, making the power of models like Qwen 3 accessible on local hardware. It’s a command-line tool that simplifies the download, setup, and execution of various open-source LLMs across macOS, Linux, and Windows.

    Ollama handles the complexities of model configuration and GPU utilization, providing a straightforward interface for developers and users. It also exposes an OpenAI-compatible API endpoint, allowing seamless integration with popular frameworks like LangChain.

    Tutorial Roadmap

    This tutorial will guide you through the process of:

    1. Setting up a local AI environment: Installing Ollama and selecting/running appropriate Qwen 3 models.

    2. Building a local RAG system: Creating a system that allows chatting with personal documents using Qwen 3, Ollama, LangChain, and ChromaDB for vector storage.

    3. Creating a basic local AI agent: Developing a simple agent powered by Qwen 3 that can utilize custom-defined tools (functions).

    How to Set Up Your Local AI Lab

    The first step is to prepare your local machine with the necessary tools and models.

    Install Ollama

    Ollama provides the simplest path to running LLMs locally.

    • Linux / macOS: Open a terminal and run the official installation script:

        curl -fsSL https://ollama.com/install.sh | sh
      
    • Windows: Download the installer from the Ollama website (https://ollama.com/download) and follow the setup instructions.

    After installation, verify it by opening a new terminal window and running:

    ollama --version
    

    Ollama typically stores downloaded models in ~/.ollama/models on macOS and /usr/share/ollama/.ollama/models on Linux/WSL.

    Choose Your Qwen 3 Model

    Selecting the right Qwen 3 model is crucial and depends on your intended task and available hardware, primarily system RAM and GPU VRAM. Running larger models requires more resources but generally offers better performance and reasoning capabilities.

    Qwen 3 offers two main architectures available through Ollama:

    • Dense Models: (like qwen3:0.6b, qwen3:4b, qwen3:8b, qwen3:14b, qwen3:32b) These models activate all their parameters during inference. Their performance is predictable, but resource requirements scale directly with parameter count.

    • Mixture-of-Experts (MoE) Models: (like qwen3:30b-a3b) These models contain many “expert” sub-networks but only activate a small fraction for each input token. This allows them to achieve the performance characteristic of their large total parameter count (for example, 30 billion) while having inference costs closer to their smaller active parameter count (for example, 3 billion). They offer a compelling balance of capability and efficiency, especially for reasoning and coding tasks.

    Recommendation for this tutorial: For the examples that follow, qwen3:8b strikes a good balance between capability and resource requirements for many modern machines. If resources are more constrained, qwen3:4b is a viable alternative. The MoE model qwen3:30b-a3b offers excellent performance, especially for coding and reasoning, and runs surprisingly well on systems with 16GB+ VRAM due to its sparse activation.

    Pull and Run Qwen 3 with Ollama

    Once you’ve chosen a model, you’ll need to download it (pull it) via Ollama.

    Pull the model: Open the terminal and run (replace qwen3:8b with the desired tag):

    ollama pull qwen3:8b
    

    This command downloads the model weights and configuration.

    Run interactively (optional test): To chat directly with the model from the command line:

    ollama run qwen3:8b
    

    Type prompts directly into the terminal. Use /bye to exit the session. Other useful commands within the interactive session include /? for help and /set parameter <name> <value> (for example, /set parameter num_ctx 8192) to temporarily change model parameters for the current session. Use ollama list outside the session to see downloaded models.

    Run as a server: For integration with Python scripts (using LangChain), Ollama needs to run as a background server process, exposing an API. Open a separate terminal window and run:

    ollama serve
    

    Keep this terminal window open while running the Python scripts. This command starts the server, typically listening on http://localhost:11434, providing an OpenAI-compatible API endpoint.

    Set Up Your Python Environment

    A dedicated Python environment is recommended for managing dependencies.

    Create a virtual environment:

    python -m venv venv
    

    Activate the environment:

    • macOS/Linux: source venv/bin/activate

    • Windows: venvScriptsactivate

    Install necessary libraries:

    pip install langchain langchain-community langchain-core langchain-ollama chromadb sentence-transformers pypdf python-dotenv unstructured[pdf] tiktoken
    
    • langchain, langchain-community, langchain-core: The core LangChain framework for building LLM applications.

    • langchain-ollama: Specific integration for using Ollama models with LangChain.

    • chromadb: The local vector database for storing document embeddings.

    • sentence-transformers: Used for an alternative local embedding method (explained later).

    • pypdf: A library for loading PDF documents.

    • python-dotenv: For managing environment variables (optional but good practice).

    • unstructured[pdf]: An alternative, powerful document loader, especially for complex PDFs.

    • tiktoken: Used by LangChain for token counting.

    The local setup involves coordinating several independent components: Ollama itself, the specific Qwen 3 model weights, the Python environment, and various libraries like LangChain and ChromaDB. Ensuring compatibility between these pieces and correctly configuring parameters (like Ollama’s context window size or selecting a model appropriate for the available VRAM) is key to a smooth experience.

    While this modularity offers flexibility – allowing components like the LLM or vector store to be swapped – it also means the initial setup requires careful attention to detail. This tutorial aims to provide clear steps and sensible defaults to minimize potential friction points.

    How to Build a Local RAG System with Qwen 3

    Retrieval-Augmented Generation (RAG) is a powerful technique that enhances LLMs by providing them with external knowledge.

    Instead of relying solely on its training data, the LLM can retrieve relevant information from a specified document set (like local PDFs) and uses that information to answer questions. This significantly reduces “hallucinations” (incorrect or fabricated information) and allows the LLM to answer questions about specific, private data without needing retraining.

    The core RAG process involves:

    1. Loading and splitting documents into manageable chunks.

    2. Converting these chunks into numerical representations (embeddings) using an embedding model.

    3. Storing these embeddings in a vector database for efficient searching.

    4. When a query comes in, embedding the query and searching the vector database for the most similar document chunks.

    5. Providing these relevant chunks (context) along with the original query to the LLM to generate an informed answer.

    Let’s build this locally using Qwen 3, Ollama, LangChain, and ChromaDB.

    Step 1: Prepare Your Data

    Create a directory named data in the project folder. Place the PDF document that you intend to query into this directory. For this tutorial, using a single, primarily text-based PDF (like a research paper or a report) for simplicity.

    mkdir data
    <span class="hljs-comment"># Copy your PDF file into the 'data' directory</span>
    <span class="hljs-comment"># e.g., cp ~/Downloads/some_paper.pdf./data/mydocument.pdf</span>
    

    If you don’t have a PDF readily available that you’d like to use, you can download a sample PDF (the Llama 2 paper) for this tutorial using the following command in your terminal:

    
    wget --user-agent <span class="hljs-string">"Mozilla"</span> <span class="hljs-string">"https://arxiv.org/pdf/2307.09288.pdf"</span> -O <span class="hljs-string">"data/llama2.pdf"</span>
    

    This command creates the data directory and downloads the PDF, saving it as llama2.pdf inside the data directory. If you prefer to use your own document, place your PDF file into the data directory and update the filename in the subsequent Python code.

    Step 2: Load Documents in Python

    Use LangChain’s document loaders to read the PDF content. PyPDFLoader is straightforward for simple PDFs. UnstructuredPDFLoader (requires unstructured[pdf]) can handle more complex layouts but has more dependencies.

    <span class="hljs-comment"># rag_local.py</span>
    <span class="hljs-keyword">import</span> os
    <span class="hljs-keyword">from</span> dotenv <span class="hljs-keyword">import</span> load_dotenv
    <span class="hljs-keyword">from</span> langchain_community.document_loaders <span class="hljs-keyword">import</span> PyPDFLoader <span class="hljs-comment"># Or UnstructuredPDFLoader</span>
    
    load_dotenv() <span class="hljs-comment"># Optional: Loads environment variables from.env file</span>
    
    DATA_PATH = <span class="hljs-string">"data/"</span>
    PDF_FILENAME = <span class="hljs-string">"mydocument.pdf"</span> <span class="hljs-comment"># Replace with your PDF filename</span>
    
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">load_documents</span>():</span>
        <span class="hljs-string">"""Loads documents from the specified data path."""</span>
        pdf_path = os.path.join(DATA_PATH, PDF_FILENAME)
        loader = PyPDFLoader(pdf_path)
        <span class="hljs-comment"># loader = UnstructuredPDFLoader(pdf_path) # Alternative</span>
        documents = loader.load()
        print(<span class="hljs-string">f"Loaded <span class="hljs-subst">{len(documents)}</span> page(s) from <span class="hljs-subst">{pdf_path}</span>"</span>)
        <span class="hljs-keyword">return</span> documents
    
    <span class="hljs-comment"># documents = load_documents() # Call this later</span>
    

    Step 3: Split Documents

    Large documents need to be split into smaller chunks suitable for embedding and retrieval. The RecursiveCharacterTextSplitter attempts to split text semantically (at paragraphs, sentences, and so on) before resorting to fixed-size splits. chunk_size determines the maximum size of each chunk (in characters), and chunk_overlap specifies how many characters should overlap between consecutive chunks to maintain context.

    <span class="hljs-comment"># rag_local.py (continued)</span>
    <span class="hljs-keyword">from</span> langchain_text_splitters <span class="hljs-keyword">import</span> RecursiveCharacterTextSplitter
    
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">split_documents</span>(<span class="hljs-params">documents</span>):</span>
        <span class="hljs-string">"""Splits documents into smaller chunks."""</span>
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=<span class="hljs-number">1000</span>,
            chunk_overlap=<span class="hljs-number">200</span>,
            length_function=len,
            is_separator_regex=<span class="hljs-literal">False</span>,
        )
        all_splits = text_splitter.split_documents(documents)
        print(<span class="hljs-string">f"Split into <span class="hljs-subst">{len(all_splits)}</span> chunks"</span>)
        <span class="hljs-keyword">return</span> all_splits
    
    <span class="hljs-comment"># loaded_docs = load_documents()</span>
    <span class="hljs-comment"># chunks = split_documents(loaded_docs) # Call this later</span>
    

    Step 4: Choose and Configure Embedding Model

    Embeddings transform text into vectors (lists of numbers) such that semantically similar text chunks have vectors that are close together in multi-dimensional space.

    Option A (Recommended for Simplicity): Ollama Embeddings

    This approach uses Ollama to serve a dedicated embedding model. nomic-embed-text is a capable open-source model available via Ollama.

    First, ensure the embedding model is pulled:

    ollama pull nomic-embed-text
    

    Then, use OllamaEmbeddings in Python:

    <span class="hljs-comment"># rag_local.py (continued)</span>
    <span class="hljs-keyword">from</span> langchain_ollama <span class="hljs-keyword">import</span> OllamaEmbeddings
    
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_embedding_function</span>(<span class="hljs-params">model_name=<span class="hljs-string">"nomic-embed-text"</span></span>):</span>
        <span class="hljs-string">"""Initializes the Ollama embedding function."""</span>
        <span class="hljs-comment"># Ensure Ollama server is running (ollama serve)</span>
        embeddings = OllamaEmbeddings(model=model_name)
        print(<span class="hljs-string">f"Initialized Ollama embeddings with model: <span class="hljs-subst">{model_name}</span>"</span>)
        <span class="hljs-keyword">return</span> embeddings
    
    <span class="hljs-comment"># embedding_function = get_embedding_function() # Call this later</span>
    

    Option B (Alternative): Sentence Transformers

    This uses the sentence-transformers library directly within the Python script. It requires installing the library (pip install sentence-transformers) but doesn’t need a separate Ollama process for embeddings. Models like all-MiniLM-L6-v2 are fast and lightweight, while all-mpnet-base-v2 offers higher quality.

    <span class="hljs-comment"># Alternative embedding function using Sentence Transformers</span>
    <span class="hljs-keyword">from</span> langchain_community.embeddings <span class="hljs-keyword">import</span> HuggingFaceEmbeddings
    
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_embedding_function_hf</span>(<span class="hljs-params">model_name=<span class="hljs-string">"all-MiniLM-L6-v2"</span></span>):</span>
         <span class="hljs-string">"""Initializes HuggingFace embeddings (runs locally)."""</span>
         embeddings = HuggingFaceEmbeddings(model_name=model_name)
         print(<span class="hljs-string">f"Initialized HuggingFace embeddings with model: <span class="hljs-subst">{model_name}</span>"</span>)
         <span class="hljs-keyword">return</span> embeddings
    
    embedding_function = get_embedding_function_hf() <span class="hljs-comment"># Use this if choosing Option B</span>
    

    For this tutorial, we’ll use Option A (Ollama Embeddings with nomic-embed-text) to keep the toolchain consistent.

    Step 5: Set Up Local Vector Store (ChromaDB)

    ChromaDB provides an efficient way to store and search vector embeddings locally. Using a persistent client ensures the indexed data is saved to disk and can be reloaded without re-processing the documents every time.

    <span class="hljs-comment"># rag_local.py (continued)</span>
    <span class="hljs-keyword">from</span> langchain_community.vectorstores <span class="hljs-keyword">import</span> Chroma
    
    CHROMA_PATH = <span class="hljs-string">"chroma_db"</span> <span class="hljs-comment"># Directory to store ChromaDB data</span>
    
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_vector_store</span>(<span class="hljs-params">embedding_function, persist_directory=CHROMA_PATH</span>):</span>
        <span class="hljs-string">"""Initializes or loads the Chroma vector store."""</span>
        vectorstore = Chroma(
            persist_directory=persist_directory,
            embedding_function=embedding_function
        )
        print(<span class="hljs-string">f"Vector store initialized/loaded from: <span class="hljs-subst">{persist_directory}</span>"</span>)
        <span class="hljs-keyword">return</span> vectorstore
    
    embedding_function = get_embedding_function()
    vector_store = get_vector_store(embedding_function) <span class="hljs-comment"># Call this later</span>
    

    Step 6: Index Documents (Embed and Store)

    This is the core indexing step where document chunks are converted to embeddings and saved in ChromaDB. The Chroma.from_documents function is convenient for the initial creation and indexing. If the database already exists, subsequent additions can use vectorstore.add_documents.

    <span class="hljs-comment"># rag_local.py (continued)</span>
    
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">index_documents</span>(<span class="hljs-params">chunks, embedding_function, persist_directory=CHROMA_PATH</span>):</span>
        <span class="hljs-string">"""Indexes document chunks into the Chroma vector store."""</span>
        print(<span class="hljs-string">f"Indexing <span class="hljs-subst">{len(chunks)}</span> chunks..."</span>)
        <span class="hljs-comment"># Use from_documents for initial creation.</span>
        <span class="hljs-comment"># This will overwrite existing data if the directory exists but isn't a valid Chroma DB.</span>
        <span class="hljs-comment"># For incremental updates, initialize Chroma first and use vectorstore.add_documents().</span>
        vectorstore = Chroma.from_documents(
            documents=chunks,
            embedding=embedding_function,
            persist_directory=persist_directory
        )
        vectorstore.persist() <span class="hljs-comment"># Ensure data is saved</span>
        print(<span class="hljs-string">f"Indexing complete. Data saved to: <span class="hljs-subst">{persist_directory}</span>"</span>)
        <span class="hljs-keyword">return</span> vectorstore
    
    <span class="hljs-comment">#... (previous function calls)</span>
    vector_store = index_documents(chunks, embedding_function) <span class="hljs-comment"># Call this for initial indexing</span>
    

    To load an existing persistent database later:

    embedding_function = get_embedding_function()
    vector_store = Chroma(persist_directory=CHROMA_PATH, embedding_function=embedding_function)
    

    Step 7: Build the RAG Chain

    Now, assemble the components into a LangChain Expression Language (LCEL) chain. This involves initializing the Qwen 3 LLM via Ollama, creating a retriever from the vector store, defining a suitable prompt, and chaining them together.

    A critical parameter when initializing ChatOllama for RAG is num_ctx. This defines the context window size (in tokens) that the LLM can handle. Ollama’s default (often 2048 or 4096 tokens) might be too small to accommodate both the retrieved document context and the user’s query/prompt.

    Qwen 3 models (8B and larger) support much larger context windows (for example, 128k tokens), but practical limits depend on your available RAM/VRAM. Setting num_ctx to a value like 8192 or higher is often necessary for effective RAG.

    <span class="hljs-comment"># rag_local.py (continued)</span>
    <span class="hljs-keyword">from</span> langchain_ollama <span class="hljs-keyword">import</span> ChatOllama
    <span class="hljs-keyword">from</span> langchain_core.prompts <span class="hljs-keyword">import</span> ChatPromptTemplate
    <span class="hljs-keyword">from</span> langchain_core.runnables <span class="hljs-keyword">import</span> RunnablePassthrough
    <span class="hljs-keyword">from</span> langchain_core.output_parsers <span class="hljs-keyword">import</span> StrOutputParser
    
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">create_rag_chain</span>(<span class="hljs-params">vector_store, llm_model_name=<span class="hljs-string">"qwen3:8b"</span>, context_window=<span class="hljs-number">8192</span></span>):</span>
        <span class="hljs-string">"""Creates the RAG chain."""</span>
        <span class="hljs-comment"># Initialize the LLM</span>
        llm = ChatOllama(
            model=llm_model_name,
            temperature=<span class="hljs-number">0</span>, <span class="hljs-comment"># Lower temperature for more factual RAG answers</span>
            num_ctx=context_window <span class="hljs-comment"># IMPORTANT: Set context window size</span>
        )
        print(<span class="hljs-string">f"Initialized ChatOllama with model: <span class="hljs-subst">{llm_model_name}</span>, context window: <span class="hljs-subst">{context_window}</span>"</span>)
    
        <span class="hljs-comment"># Create the retriever</span>
        retriever = vector_store.as_retriever(
            search_type=<span class="hljs-string">"similarity"</span>, <span class="hljs-comment"># Or "mmr"</span>
            search_kwargs={<span class="hljs-string">'k'</span>: <span class="hljs-number">3</span>} <span class="hljs-comment"># Retrieve top 3 relevant chunks</span>
        )
        print(<span class="hljs-string">"Retriever initialized."</span>)
    
        <span class="hljs-comment"># Define the prompt template</span>
        template = <span class="hljs-string">"""Answer the question based ONLY on the following context:
    {context}
    
    Question: {question}
    """</span>
        prompt = ChatPromptTemplate.from_template(template)
        print(<span class="hljs-string">"Prompt template created."</span>)
    
        <span class="hljs-comment"># Define the RAG chain using LCEL</span>
        rag_chain = (
            {<span class="hljs-string">"context"</span>: retriever, <span class="hljs-string">"question"</span>: RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
        )
        print(<span class="hljs-string">"RAG chain created."</span>)
        <span class="hljs-keyword">return</span> rag_chain
    
    <span class="hljs-comment">#... (previous function calls)</span>
    vector_store = get_vector_store(embedding_function) <span class="hljs-comment"># Assuming DB is already indexed</span>
    rag_chain = create_rag_chain(vector_store) <span class="hljs-comment"># Call this later</span>
    

    The effectiveness of the RAG system hinges on the proper configuration of each component. The chunk_size and chunk_overlap in the splitter affect what the retriever finds. Your choice of embedding_function must be consistent between indexing and querying. The num_ctx parameter for the ChatOllama LLM must be large enough to hold the retrieved context and the prompt itself. A poorly designed prompt template can also lead the LLM astray. Make sure you carefully tune these elements for optimal performance.

    Step 8: Query Your Documents

    Finally, invoke the RAG chain with a question related to the content of the indexed PDF.

    <span class="hljs-comment"># rag_local.py (continued)</span>
    
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">query_rag</span>(<span class="hljs-params">chain, question</span>):</span>
        <span class="hljs-string">"""Queries the RAG chain and prints the response."""</span>
        print(<span class="hljs-string">"nQuerying RAG chain..."</span>)
        print(<span class="hljs-string">f"Question: <span class="hljs-subst">{question}</span>"</span>)
        response = chain.invoke(question)
        print(<span class="hljs-string">"nResponse:"</span>)
        print(response)
    
    <span class="hljs-comment"># --- Main Execution ---</span>
    <span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">"__main__"</span>:
        <span class="hljs-comment"># 1. Load Documents</span>
        docs = load_documents()
    
        <span class="hljs-comment"># 2. Split Documents</span>
        chunks = split_documents(docs)
    
        <span class="hljs-comment"># 3. Get Embedding Function</span>
        embedding_function = get_embedding_function() <span class="hljs-comment"># Using Ollama nomic-embed-text</span>
    
        <span class="hljs-comment"># 4. Index Documents (Only needs to be done once per document set)</span>
        <span class="hljs-comment"># Check if DB exists, if not, index. For simplicity, we might re-index here.</span>
        <span class="hljs-comment"># A more robust approach would check if indexing is needed.</span>
        print(<span class="hljs-string">"Attempting to index documents..."</span>)
        vector_store = index_documents(chunks, embedding_function)
        <span class="hljs-comment"># To load existing DB instead:</span>
        <span class="hljs-comment"># vector_store = get_vector_store(embedding_function)</span>
    
        <span class="hljs-comment"># 5. Create RAG Chain</span>
        rag_chain = create_rag_chain(vector_store, llm_model_name=<span class="hljs-string">"qwen3:8b"</span>) <span class="hljs-comment"># Use the chosen Qwen 3 model</span>
    
        <span class="hljs-comment"># 6. Query</span>
        query_question = <span class="hljs-string">"What is the main topic of the document?"</span> <span class="hljs-comment"># Replace with a specific question</span>
        query_rag(rag_chain, query_question)
    
        query_question_2 = <span class="hljs-string">"Summarize the introduction section."</span> <span class="hljs-comment"># Another example</span>
        query_rag(rag_chain, query_question_2)
    

    Run the complete script (python rag_local.py). Make sure that the ollama serve command is running in another terminal. The script will load the PDF, split it, embed the chunks using nomic-embed-text via Ollama, store them in ChromaDB, build the RAG chain using qwen3:8b via Ollama, and finally execute the queries. It’ll print the LLM’s responses based on the document content.

    How to Create Local AI Agents with Qwen 3

    Beyond answering questions based on provided text, LLMs can act as the reasoning engine for AI agents. Agents can plan sequences of actions, interact with external tools (like functions or APIs), and work towards accomplishing more complex goals assigned by the user.

    Qwen 3 models were specifically designed with strong tool-calling and agentic capabilities. While Alibaba provides the Qwen-Agent framework, this tutorial will continue using LangChain for consistency and because its integration with Ollama for agent tasks is more readily documented in the provided materials.

    We will build a simple agent that can use a custom Python function as a tool.

    Step 1: Define Custom Tools

    Tools are standard Python functions that the agent can choose to execute. The function’s docstring is crucial, as the LLM uses it to understand what the tool does and what arguments it requires. LangChain’s @tool decorator simplifies wrapping functions for agent use.

    <span class="hljs-comment"># agent_local.py</span>
    <span class="hljs-keyword">import</span> os
    <span class="hljs-keyword">from</span> dotenv <span class="hljs-keyword">import</span> load_dotenv
    <span class="hljs-keyword">from</span> langchain.agents <span class="hljs-keyword">import</span> tool
    <span class="hljs-keyword">import</span> datetime
    
    load_dotenv() <span class="hljs-comment"># Optional</span>
    
    <span class="hljs-meta">@tool</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_current_datetime</span>(<span class="hljs-params">format: str = <span class="hljs-string">"%Y-%m-%d %H:%M:%S"</span></span>) -> str:</span>
        <span class="hljs-string">"""
        Returns the current date and time, formatted according to the provided Python strftime format string.
        Use this tool whenever the user asks for the current date, time, or both.
        Example format strings: '%Y-%m-%d' for date, '%H:%M:%S' for time.
        If no format is specified, defaults to '%Y-%m-%d %H:%M:%S'.
        """</span>
        <span class="hljs-keyword">try</span>:
            <span class="hljs-keyword">return</span> datetime.datetime.now().strftime(format)
        <span class="hljs-keyword">except</span> Exception <span class="hljs-keyword">as</span> e:
            <span class="hljs-keyword">return</span> <span class="hljs-string">f"Error formatting date/time: <span class="hljs-subst">{e}</span>"</span>
    
    <span class="hljs-comment"># List of tools the agent can use</span>
    tools = [get_current_datetime]
    print(<span class="hljs-string">"Custom tool defined."</span>)
    

    Step 2: Set Up the Agent LLM

    Instantiate the ChatOllama model again, using a Qwen 3 variant suitable for tool calling. The qwen3:8b model should be capable of handling simple tool use cases.

    It’s important to note that tool calling reliability with local models served via Ollama can sometimes be less consistent than with large commercial APIs like GPT-4 or Claude. The LLM might fail to recognize when a tool is needed, hallucinate arguments, or misinterpret the tool’s output. Starting with clear prompts and simple tools is recommended.

    <span class="hljs-comment"># agent_local.py (continued)</span>
    <span class="hljs-keyword">from</span> langchain_ollama <span class="hljs-keyword">import</span> ChatOllama
    
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_agent_llm</span>(<span class="hljs-params">model_name=<span class="hljs-string">"qwen3:8b"</span>, temperature=<span class="hljs-number">0</span></span>):</span>
        <span class="hljs-string">"""Initializes the ChatOllama model for the agent."""</span>
        <span class="hljs-comment"># Ensure Ollama server is running (ollama serve)</span>
        llm = ChatOllama(
            model=model_name,
            temperature=temperature <span class="hljs-comment"># Lower temperature for more predictable tool use</span>
            <span class="hljs-comment"># Consider increasing num_ctx if expecting long conversations or complex reasoning</span>
            <span class="hljs-comment"># num_ctx=8192</span>
        )
        print(<span class="hljs-string">f"Initialized ChatOllama agent LLM with model: <span class="hljs-subst">{model_name}</span>"</span>)
        <span class="hljs-keyword">return</span> llm
    
    <span class="hljs-comment"># agent_llm = get_agent_llm() # Call this later</span>
    

    Step 3: Create the Agent Prompt

    Agents require specific prompt structures that guide their reasoning and tool use. The prompt typically includes placeholders for user input (input), conversation history (chat_history), and the agent_scratchpad. The scratchpad is where the agent records its internal “thought” process, the tools it decides to call, and the results (observations) it gets back from those tools. LangChain Hub provides pre-built prompts suitable for tool-calling agents.

    <span class="hljs-comment"># agent_local.py (continued)</span>
    <span class="hljs-keyword">from</span> langchain <span class="hljs-keyword">import</span> hub
    
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_agent_prompt</span>(<span class="hljs-params">prompt_hub_name=<span class="hljs-string">"hwchase17/openai-tools-agent"</span></span>):</span>
        <span class="hljs-string">"""Pulls the agent prompt template from LangChain Hub."""</span>
        <span class="hljs-comment"># This prompt is designed for OpenAI but often works well with other tool-calling models.</span>
        <span class="hljs-comment"># Alternatively, define a custom ChatPromptTemplate.</span>
        prompt = hub.pull(prompt_hub_name)
        print(<span class="hljs-string">f"Pulled agent prompt from Hub: <span class="hljs-subst">{prompt_hub_name}</span>"</span>)
        <span class="hljs-comment"># print("Prompt Structure:")</span>
        <span class="hljs-comment"># prompt.pretty_print() # Uncomment to see the prompt structure</span>
        <span class="hljs-keyword">return</span> prompt
    
    <span class="hljs-comment"># agent_prompt = get_agent_prompt() # Call this later</span>
    

    Step 4: Build the Agent

    The create_tool_calling_agent function combines the LLM, the defined tools, and the prompt into a runnable unit that represents the agent’s core logic.

    <span class="hljs-comment"># agent_local.py (continued)</span>
    <span class="hljs-keyword">from</span> langchain.agents <span class="hljs-keyword">import</span> create_tool_calling_agent
    
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">build_agent</span>(<span class="hljs-params">llm, tools, prompt</span>):</span>
        <span class="hljs-string">"""Builds the tool-calling agent runnable."""</span>
        agent = create_tool_calling_agent(llm, tools, prompt)
        print(<span class="hljs-string">"Agent runnable created."</span>)
        <span class="hljs-keyword">return</span> agent
    
    <span class="hljs-comment"># agent_runnable = build_agent(agent_llm, tools, agent_prompt) # Call this later</span>
    

    Step 5: Create the Agent Executor

    The AgentExecutor is responsible for running the agent loop. It takes the agent runnable and the tools, invokes the agent with the input, parses the agent’s output (which could be a final answer or a tool call request), executes any requested tool calls, and feeds the results back to the agent until a final answer is reached. Setting verbose=True is highly recommended during development to observe the agent’s step-by-step execution flow.

    <span class="hljs-comment"># agent_local.py (continued)</span>
    <span class="hljs-keyword">from</span> langchain.agents <span class="hljs-keyword">import</span> AgentExecutor
    
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">create_agent_executor</span>(<span class="hljs-params">agent, tools</span>):</span>
        <span class="hljs-string">"""Creates the agent executor."""</span>
        agent_executor = AgentExecutor(
            agent=agent,
            tools=tools,
            verbose=<span class="hljs-literal">True</span> <span class="hljs-comment"># Set to True to see agent thoughts and tool calls</span>
        )
        print(<span class="hljs-string">"Agent executor created."</span>)
        <span class="hljs-keyword">return</span> agent_executor
    
    <span class="hljs-comment"># agent_executor = create_agent_executor(agent_runnable, tools) # Call this later</span>
    

    Step 6: Run the Agent

    Invoke the agent executor with a user query that should trigger the use of the defined tool.

    <span class="hljs-comment"># agent_local.py (continued)</span>
    
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">run_agent</span>(<span class="hljs-params">executor, user_input</span>):</span>
        <span class="hljs-string">"""Runs the agent executor with the given input."""</span>
        print(<span class="hljs-string">"nInvoking agent..."</span>)
        print(<span class="hljs-string">f"Input: <span class="hljs-subst">{user_input}</span>"</span>)
        response = executor.invoke({<span class="hljs-string">"input"</span>: user_input})
        print(<span class="hljs-string">"nAgent Response:"</span>)
        print(response[<span class="hljs-string">'output'</span>])
    
    <span class="hljs-comment"># --- Main Execution ---</span>
    <span class="hljs-keyword">if</span> __name__ == <span class="hljs-string">"__main__"</span>:
        <span class="hljs-comment"># 1. Define Tools (already done above)</span>
    
        <span class="hljs-comment"># 2. Get Agent LLM</span>
        agent_llm = get_agent_llm(model_name=<span class="hljs-string">"qwen3:8b"</span>) <span class="hljs-comment"># Use the chosen Qwen 3 model</span>
    
        <span class="hljs-comment"># 3. Get Agent Prompt</span>
        agent_prompt = get_agent_prompt()
    
        <span class="hljs-comment"># 4. Build Agent Runnable</span>
        agent_runnable = build_agent(agent_llm, tools, agent_prompt)
    
        <span class="hljs-comment"># 5. Create Agent Executor</span>
        agent_executor = create_agent_executor(agent_runnable, tools)
    
        <span class="hljs-comment"># 6. Run Agent</span>
        run_agent(agent_executor, <span class="hljs-string">"What is the current date?"</span>)
        run_agent(agent_executor, <span class="hljs-string">"What time is it right now? Use HH:MM format."</span>)
        run_agent(agent_executor, <span class="hljs-string">"Tell me a joke."</span>) <span class="hljs-comment"># Should not use the tool</span>
    

    Running python agent_local.py (with ollama serve active) will execute the agent. The verbose=True setting will print output resembling the ReAct (Reasoning and Acting) framework, showing the agent’s internal “Thoughts” on how to proceed, the “Action” it decides to take (calling a specific tool with arguments), and the “Observation” (the result returned by the tool).

    Building reliable agents with local models presents unique challenges. The LLM’s ability to correctly interpret the prompt, understand when to use tools, select the right tool, generate valid arguments, and process the tool’s output is critical.

    Local models, especially smaller or heavily quantized ones, might struggle with these reasoning steps compared to larger, cloud-based counterparts. If the qwen3:8b model proves unreliable for more complex agentic tasks, consider trying qwen3:14b or the efficient qwen3:30b-a3b if hardware permits.

    For highly complex or stateful agent workflows, exploring frameworks like LangGraph, which offers more control over the agent’s execution flow, might be beneficial.

    Advanced Considerations and Troubleshooting

    Running LLMs locally offers great flexibility but also introduces specific configuration aspects and potential issues.

    Controlling Qwen 3’s Thinking Mode with Ollama

    Qwen 3’s unique hybrid inference allows switching between a deep “thinking” mode for complex reasoning and a faster “non-thinking” mode for general chat. While frameworks like Hugging Face Transformers or vLLM might offer explicit parameters (enable_thinking), the primary way to control this when using Ollama appears to be through “soft switches” embedded in the prompt.

    Append /think to the end of a user prompt to encourage step-by-step reasoning, or /no_think to request a faster, direct response. You can do this via the Ollama CLI or potentially within the prompts sent via the API/LangChain.

    <span class="hljs-comment"># Example using LangChain's ChatOllama</span>
    <span class="hljs-keyword">from</span> langchain_ollama <span class="hljs-keyword">import</span> ChatOllama
    
    llm_think = ChatOllama(model=<span class="hljs-string">"qwen3:8b"</span>)
    llm_no_think = ChatOllama(model=<span class="hljs-string">"qwen3:8b"</span>) <span class="hljs-comment"># Could also set system prompt</span>
    
    <span class="hljs-comment"># Invoke with prompt modification</span>
    response_think = llm_think.invoke(<span class="hljs-string">"Solve the equation 2x + 5 = 15 /think"</span>)
    print(<span class="hljs-string">"Thinking Response:"</span>, response_think)
    
    response_no_think = llm_no_think.invoke(<span class="hljs-string">"What is the capital of France? /no_think"</span>)
    print(<span class="hljs-string">"Non-Thinking Response:"</span>, response_no_think)
    
    <span class="hljs-comment"># Alternatively, set via system message (might be less reliable turn-by-turn)</span>
    llm_system_no_think = ChatOllama(model=<span class="hljs-string">"qwen3:8b"</span>, system=<span class="hljs-string">"/no_think"</span>)
    response_system = llm_system_no_think.invoke(<span class="hljs-string">"What is 2+2?"</span>)
    print(<span class="hljs-string">"System No-Think Response:"</span>, response_system)
    

    Note that the persistence of these tags across multiple turns in a conversation might require careful prompt management.

    Managing Context Length (num_ctx)

    The context window (num_ctx) determines how much information (prompt, history, retrieved documents) the LLM can consider at once. Qwen 3 models (8B+) support large native context lengths (for example, 128k tokens), but Ollama often defaults to a much smaller window (like 2048 or 4096). For RAG or conversations requiring memory of earlier turns, this default is often insufficient.

    Set num_ctx when initializing ChatOllama or OllamaLLM in LangChain:

    <span class="hljs-comment"># Example setting context window to 8192 tokens</span>
    llm = ChatOllama(model=<span class="hljs-string">"qwen3:8b"</span>, num_ctx=<span class="hljs-number">8192</span>)
    

    Be mindful that larger num_ctx values significantly increase RAM and VRAM consumption. But setting it too low can lead to the model “forgetting” context or even entering repetitive loops. Choose a value that balances the task requirements with hardware capabilities.

    Hardware Limitations and VRAM

    Running LLMs locally is resource-intensive.

    • VRAM: A dedicated GPU (NVIDIA or Apple Silicon) with sufficient VRAM is highly recommended for acceptable performance. The amount of VRAM dictates the largest model size that can run efficiently. Refer to the table in Section 2 for estimates.

    • RAM: System RAM is also crucial, especially if the model doesn’t fit entirely in VRAM. Ollama can utilize system RAM as a fallback, but this is significantly slower.

    • Quantization: Ollama typically serves quantized models (for example., 4-bit or 5-bit), which reduce the model size and VRAM requirements significantly compared to full-precision models, often with minimal performance degradation for many tasks. The tags like :4b, :8b usually imply a default quantization level.

    If performance is slow or errors occur due to resource constraints, consider:

    • Using a smaller Qwen 3 model (like 4B instead of 8B).

    • Ensuring Ollama is correctly detecting and utilizing the GPU (check Ollama logs or system monitoring tools).

    • Closing other resource-intensive applications.

    Conclusion and Next Steps

    This tutorial gave you a practical walkthrough for setting up your local AI environment using the powerful and open Qwen 3 LLM family with the user-friendly Ollama tool.

    If you’ve followed these steps, you should have successfully:

    1. Installed Ollama and downloaded/run Qwen 3 models locally.

    2. Built a functional Retrieval-Augmented Generation (RAG) pipeline using LangChain and ChromaDB to query local documents.

    3. Created a basic AI agent capable of reasoning and utilizing custom Python tools.

    Running these systems locally unlocks significant advantages in privacy, cost, and customization, making advanced AI capabilities more accessible than ever. The combination of Qwen 3’s performance and open license with Ollama’s ease of use creates a potent platform for experimentation and development.

    Official Resources:

    • Qwen 3: GitHub, Documentation

    • Ollama: Website, Model Library, GitHub

    • LangChain: Python Documentation

    • ChromaDB: Documentation

    • Sentence Transformers: Documentation

    By leveraging these powerful, free, and open-source tools, you can continue to push the boundaries of what’s possible with AI running directly on your own hardware.

    Source: freeCodeCamp Programming Tutorials: Python, JavaScript, Git & More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleCVE-2025-0856 – WordPress PGS Core Plugin Unauthenticated Remote Data Manipulation
    Next Article CVE-2025-0855 – WordPress PGS Core Plugin PHP Object Injection Vulnerability

    Related Posts

    Development

    Stop using .reverse().find(): meet findLast()

    September 23, 2025
    Development

    @ts-ignore is almost always the worst option

    September 22, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    When Array uses less memory than Uint8Array (in V8)

    Development

    Sick of AI slop on Pinterest? These two new features should help bring back real pins

    News & Updates

    CVE-2025-4012 – Playeduxyz PlayEdu SSRF Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    As big hitting Xbox first-party titles head to PlayStation 5, which games would you like to see head the other way?

    News & Updates

    Highlights

    How to Change Power and Sleep Settings in Windows 11

    August 5, 2025

    Want your PC to stop sleeping too soon or save more power? Windows 11 lets…

    CVE-2025-5307 – Santesoft Sante DICOM Viewer Pro RCE Memory Corruption

    May 29, 2025

    Adobe Firefly gets a slew of new image-generating models – including from OpenAI and Google

    April 24, 2025

    CVE-2025-8764 – Linlinjava Litemall Unrestricted File Upload Vulnerability

    August 9, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.