Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      This week in AI updates: Mistral’s new Le Chat features, ChatGPT updates, and more (September 5, 2025)

      September 6, 2025

      Designing For TV: Principles, Patterns And Practical Guidance (Part 2)

      September 5, 2025

      Neo4j introduces new graph architecture that allows operational and analytics workloads to be run together

      September 5, 2025

      Beyond the benchmarks: Understanding the coding personalities of different LLMs

      September 5, 2025

      Hitachi Energy Pledges $1B to Strengthen US Grid, Build Largest Transformer Plant in Virginia

      September 5, 2025

      How to debug a web app with Playwright MCP and GitHub Copilot

      September 5, 2025

      Between Strategy and Story: Thierry Chopain’s Creative Path

      September 5, 2025

      What You Need to Know About CSS Color Interpolation

      September 5, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Why browsers throttle JavaScript timers (and what to do about it)

      September 6, 2025
      Recent

      Why browsers throttle JavaScript timers (and what to do about it)

      September 6, 2025

      How to create Google Gemini AI component in Total.js Flow

      September 6, 2025

      Drupal 11’s AI Features: What They Actually Mean for Your Team

      September 5, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Harnessing GitOps on Linux for Seamless, Git-First Infrastructure Management

      September 6, 2025
      Recent

      Harnessing GitOps on Linux for Seamless, Git-First Infrastructure Management

      September 6, 2025

      How DevOps Teams Are Redefining Reliability with NixOS and OSTree-Powered Linux

      September 5, 2025

      Distribution Release: Linux Mint 22.2

      September 4, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Artificial Intelligence»A practical guide to modern document parsing

    A practical guide to modern document parsing

    September 6, 2025

    A practical guide to modern document parsing

    Here in 2025, document processing systems are more sophisticated than ever, yet the old principle ‘Garbage In, Garbage Out’ (GIGO) remains critically relevant. Organizations investing heavily in Retrieval-Augmented Generation (RAG) systems and fine-tuned LLMs often overlook a fundamental bottleneck: data quality at the source.

    Before any AI system can deliver intelligent responses, the unstructured data from PDFs, invoices, and contracts must be accurately converted into structured formats that models can process. Document parsing—this often-overlooked first step—can make or break your entire AI pipeline. At Nanonets, we’ve observed how seemingly minor parsing errors cascade into major production failures.

    This guide focuses on getting that foundational step right. We’ll explore modern document parsing in depth, moving beyond the hype to practical insights: from legacy OCR to intelligent, layout-aware AI, the components of robust data pipelines, and how to choose the right tools for your specific needs.


    What document parsing is, really

    Document parsing transforms unstructured or semi-structured documents into structured data. It converts documents like PDF invoices or scanned contracts into machine-readable formats such as JSON or CSV files.

    Instead of just having a flat image or a wall of text, you get organized, usable data like this:

    • invoice_number: “INV-AJ355548”
    • invoice_date: “09/07/1992”
    • total_amount: 1500.00

    Understanding how parsing fits with related technologies is crucial, as they work together in sequence:

    • Optical Character Recognition (OCR) forms the foundation by converting printed and handwritten text from images into machine-readable data.
    • Document parsing analyzes the document’s content and layout after OCR digitizes the text, identifying and extracting specific, relevant information and structuring it into usable formats like tables or key-value pairs.
    • Data extraction is the broader term for the overall process. Parsing is a specialized type of data extraction that focuses on understanding structure and context to extract specific fields.
    • Natural Language Processing (NLP) allows the system to understand the meaning and grammar of extracted text, such as identifying “Wayne Enterprises” as an organization or recognizing that “Due in 30 days” is a payment term.

    A modern document parsing tool intelligently combines all these technologies, not just to read, but to understand documents.


    The evolution of parsing

    Document parsing isn’t new, but it sure has certainly grown significantly. Let’s look at how the fundamental philosophies behind it have evolved over the past few decades.

    a. The modular pipeline approach

    The traditional approach to document processing relies on a modular, multi-stage pipeline where documents pass sequentially from one specialized tool to the next:

    1. Document Layout Analysis (DLA) uses computer vision models to detect the physical layout and draw bounding boxes around text blocks, tables, and images.
    2. OCR converts the pixels within each bounding box into character strings.
    3. Data structuring uses rules-based systems or scripts to stitch disparate information back together into coherent, structured output.

    The fundamental flaw of this pipeline is the lack of shared context. An error at any stage—a misidentified layout block or poorly read character—cascades down the line and corrupts the final output.

    b. The machine learning and AI-driven approach

    The next leap forward introduced machine learning. Instead of relying on fixed coordinates, AI models trained on thousands of examples recognize data based on context, much like humans do. For example, a model learns that a date following “Invoice Date” is probably the invoice_date, regardless of where it appears on the page.

    This approach enabled pre-trained models that understand common documents like invoices, receipts, and purchase orders out of the box. For unique documents, you can create custom models by providing just 10-15 training examples. The AI learns patterns and accurately extracts data from new, unseen layouts.

    c. The VLM end-to-end approach

    Today’s cutting-edge approach uses Vision-Language Models (VLMs), which represent a fundamental shift by processing a document’s visual information (layout, images, tables) and textual content simultaneously within a single, unified model.

    Unlike previous methods that detect a box and then run OCR on the text inside, VLMs understand that the pixels forming a table’s shape are directly related to the text constituting its rows and columns. This integrated approach finally bridges the “semantic gap” between how humans see documents and how machines process them.

    Key capabilities enabled by VLMs include:

    • End-to-end processing: VLMs can perform an entire parsing job in one step. They can look at a document image and directly generate a structured output (like Markdown or JSON) without needing a separate pipeline of layout analysis, OCR, and relation extraction modules.
    • True layout and content understanding: Because they process vision and text together, they can accurately interpret complex layouts with multiple columns, handle tables that span pages, and correctly associate captions with their corresponding images. Traditional OCR, by contrast, often treats documents as flat text, losing crucial structural information.
    • Semantic tagging: A VLM can go beyond just extracting text. As we developed our open-source Nanonets-OCR-s model, a VLM can identify and specifically tag different types of content, such as <equations>, <signatures>, <table>, and <watermarks>, because it understands the unique visual characteristics of these elements.
    • Zero-shot performance: Because VLMs have a generalized understanding of what documents look like, they can often extract information from a document format they have never been specifically trained on. With Nanonets’ zero-shot models, you can provide a clear description of a field, and the AI uses its intelligence to find it without any initial training data.

    Choosing your document parsing tools

    The question we see constantly on developer forums is: “I have 50K pages with tables, text, images… what’s the best document parser available right now?” The answer depends on what you need, but let’s look at the leading options across different categories.

    a. Open-source libraries

    1. PyMuPDF/PyPDF are praised for speed and efficiency in extracting raw text and metadata from digitally-native PDFs. They excel at simple text retrieval but offer little structural understanding.
    2. Unstructured.io is a modern library handling various document types, employing multiple techniques to extract and structure information from text, tables, and layouts.
    3. Marker is highlighted for high-quality PDF-to-Markdown conversion, making it excellent for RAG pipelines, though its license may concern commercial users.
    4. Docling provides a powerful, comprehensive solution by IBM for parsing and converting documents into multiple formats, though it’s compute-intensive and often requires GPU acceleration.
    5. Surya focuses specifically on text detection and layout analysis, representing a key component in modular pipeline approaches.
    6. DocStrange is a versatile Python library designed for developers needing both convenience and control. It extracts and converts data from any document type (PDFs, Word docs, images) into clean Markdown or JSON. It uniquely offers both free cloud processing for instant results and 100% local processing for privacy-sensitive use cases.
    7. Nanonets-OCR-s is an open-source Vision-Language Model that goes far beyond traditional text extraction by understanding document structure and content context. It intelligently recognizes and tags complex elements like tables, LaTeX equations, images, signatures, and watermarks, making it ideal for building sophisticated, context-aware parsing pipelines.

    These libraries offer maximum control and flexibility for developers building completely custom solutions. However, they require significant development and maintenance effort, and you’re responsible for the entire workflow—from hosting and OCR to data validation and integration.

    b. Commercial platforms

    For businesses needing reliable, scalable, secure solutions without dedicating development teams to the task, commercial platforms provide end-to-end solutions with minimal setup, user-friendly interfaces, and managed infrastructure.

    Platforms such as Nanonets, Docparser, and Azure Document Intelligence offer complete, managed services. While accuracy, functionality, and automation levels vary between services, they generally bundle core parsing technology with complete workflow suites, including automated importing, AI-powered validation rules, human-in-the-loop interfaces for approvals, and pre-built integrations for exporting data to business software.

    Pros of commercial platforms:

    • Ready to use out of the box with intuitive, no-code interfaces
    • Managed infrastructure, enterprise-grade security, and dedicated support
    • Full workflow automation, saving significant development time

    Cons of commercial platforms:

    • Subscription costs
    • Less customization flexibility

    Best for: Businesses wanting to focus on core operations rather than building and maintaining data extraction pipelines.

    Understanding these options helps inform the decision between building custom solutions and using managed platforms. Let’s now explore how to implement a custom solution with a practical tutorial.


    Getting started with document parsing using DocStrange

    Modern libraries like DocStrange and others provide the building blocks you need. Most follow similar patterns, initialize an extractor, point it at your documents, and get clean, structured output that works seamlessly with AI frameworks.

    Let’s look at a few examples:

    Prerequisites

    Before starting, ensure you have:

    • Python 3.8 or higher installed on your system
    • A sample document (e.g., report.pdf) in your working directory
    • Required libraries installed with this command:

    For local processing, you’ll also need to install and run Ollama.

    pip install docstrange langchain sentence-transformers faiss-cpu
    # For local processing with enhanced JSON extraction:
    pip install 'docstrange[local-llm]'
    # Install Ollama from https://ollama.com
    ollama serve
    ollama pull llama3.2

    Note: Local processing requires significant computational resources and Ollama for enhanced extraction. Cloud processing works immediately without additional setup.

    a. Parse the document into clean markdown

    from docstrange import DocumentExtractor
    
    # Initialize extractor (cloud mode by default)
    extractor = DocumentExtractor()
    
    # Convert any document to clean markdown
    result = extractor.extract("document.pdf")
    markdown = result.extract_markdown()
    print(markdown)

    b. Convert multiple file types

    from docstrange import DocumentExtractor
    
    extractor = DocumentExtractor()
    
    # PDF document
    pdf_result = extractor.extract("report.pdf")
    print(pdf_result.extract_markdown())
    
    # Word document  
    docx_result = extractor.extract("document.docx")
    print(docx_result.extract_data())
    
    # Excel spreadsheet
    excel_result = extractor.extract("data.xlsx")
    print(excel_result.extract_csv())
    
    # PowerPoint presentation
    pptx_result = extractor.extract("slides.pptx")
    print(pptx_result.extract_html())
    
    # Image with text
    image_result = extractor.extract("screenshot.png")
    print(image_result.extract_text())
    
    # Web page
    url_result = extractor.extract("https://example.com")
    print(url_result.extract_markdown())

    c. Extract specific fields and structured data

    # Extract specific fields from any document
    result = extractor.extract("invoice.pdf")
    
    # Method 1: Extract specific fields
    extracted = result.extract_data(specified_fields=[
        "invoice_number", 
        "total_amount", 
        "vendor_name",
        "due_date"
    ])
    
    # Method 2: Extract using JSON schema
    schema = {
        "invoice_number": "string",
        "total_amount": "number", 
        "vendor_name": "string",
        "line_items": [{
            "description": "string",
            "amount": "number"
        }]
    }
    
    structured = result.extract_data(json_schema=schema)

    Find more such examples here.


    A modern document parsing workflow in action

    Discussing tools and technologies in the abstract is one thing, but seeing how they solve a real-world problem is another. To make this more concrete, let’s walk through what a modern, end-to-end workflow actually looks like when you use a managed platform.

    Step 1: Import documents from anywhere

    The workflow begins the moment a document is created. The goal is to ingest it automatically, without human intervention. A robust platform should allow you to import documents from the sources you already use:

    • Email: You can set up an auto-forwarding rule to send all attachments from an address like invoices@yourcompany.com directly to a dedicated Nanonets email address for that workflow.
    • Cloud Storage: Connect folders in Google Drive, Dropbox, OneDrive, or SharePoint so that any new file added is automatically picked up for processing.
    • API: For full integration, you can push documents directly from your existing software portals into the workflow programmatically.

    Step 2: Intelligent data capture and enrichment

    Once a document arrives, the AI model gets to work. This isn’t just basic OCR; the AI analyzes the document’s layout and content to extract the fields you’ve defined. For an invoice, a pre-trained model like the Nanonets Invoice Model can instantly capture dozens of standard fields, from the seller_name and buyer_address to complex line items in a table.

    But modern systems go beyond simple extraction. They also enrich the data. For instance, the system can add a confidence score to each extracted field, letting you know how certain the AI is about its accuracy. This is crucial for building trust in the automation process.

    Step 3: Validate and approve with a human in the loop

    No AI is perfect, which is why a “human-in-the-loop” is essential for trust and accuracy, especially in high-stakes environments like finance and legal. This is where Approval Workflows come in. You can set up custom rules to flag documents for manual review, creating a safety net for your automation. For example:

    • Flag if invoice_amount is greater than $5,000.
    • Flag if vendor_name does not match an entry in your pre-approved vendor database.
    • Flag if the document is a suspected duplicate.

    If a rule is triggered, the document is automatically assigned to the right team member for a quick review. They can make corrections with a simple point-and-click interface. With Nanonets’ Instant Learning models, the AI learns from these corrections immediately, improving its accuracy for the very next document without needing a complete retraining cycle.

    Step 4: Export to your systems of record

    After the data is captured and verified, it needs to go where the work gets done. The final step is to export the structured data. This can be a direct integration with your accounting software, such as QuickBooks or Xero, your ERP, or another system via API. You can also export the data as a CSV, XML, or JSON file and send it to a destination of your choice. With webhooks, you can be notified in real-time as soon as a document is processed, triggering actions in thousands of other applications.


    Overcoming the toughest parsing challenges

    While workflows sound straightforward for clean documents, reality is often messier—the most significant modern challenges in document parsing stem from inherent AI model limitations rather than documents themselves.

    Challenge 1: The context window bottleneck

    Vision-Language Models have finite “attention” spans. Processing high-resolution, text-dense A4 pages is akin to reading newspapers through straws—models can only “see” small patches at a time, thereby losing theglobal context. This issue worsens with long documents, such as 50-page legal contracts, where models struggle to hold entire documents in memory and understand cross-page references.

    Solution: Sophisticated chunking and context management. Modern systems use preliminary layout analysis to identify semantically related sections and employ models designed explicitly for multi-page understanding. Advanced platforms handle this complexity behind the scenes, managing how long documents are chunked and contextualized to preserve cross-page relationships.

    Real-world success: StarTex, behind the EHS Insight compliance system, needed to digitize millions of chemical Safety Data Sheets (SDSs). These documents are often 10-20 pages long and information-heavy, making them classic multi-page parsing challenges. By using advanced parsing systems to process entire documents while maintaining context across all pages, they reduced processing time from 10 minutes to just 10 seconds.

    “We had to create a database with millions of documents from vendors across the world; it would be impossible for us to capture the required fields manually.” — Eric Stevens, Co-founder & CTO.

    Challenge 2: The semantic vs. literal extraction dilemma

    Accurately extracting text like “August 19, 2025” isn’t enough. The critical task is understanding its semantic role. Is it an invoice_date, due_date, or shipping_date? This lack of true semantic understanding causes major errors in automated bookkeeping.

    Solution: Integration of LLM reasoning capabilities into VLM architecture. Modern parsers use surrounding text and layout as evidence to infer correct semantic labels. Zero-shot models exemplify this approach — you provide semantic targets like “The final date by which payment must be made,” and models use deep language understanding and document conventions to find and correctly label corresponding dates.

    Real-world success: Global paper leader Suzano International handled purchase orders from over 70 customers across hundreds of different templates and formats, including PDFs, emails, and scanned Excel sheet images. Template-based approaches were impossible. Using template-agnostic, AI-driven solutions, they automated entire processes within single workflows, reducing purchase order processing time by 90%—from 8 minutes to 48 seconds.

    “The unique aspect of Nanonets… was its ability to handle different templates as well as different formats of the document, which is quite unique from its competitors that create OCR models based specific to a single format in one automation.” — Cristinel Tudorel Chiriac, Project Manager

    Challenge 3: Trust, verification, and hallucinations

    Even powerful AI models can be “black boxes,” making it difficult to understand their extraction reasoning. More critically, VLMs can hallucinate — inventing plausible-looking data that isn’t actually in documents. This introduces unacceptable risk in business-critical workflows.

    Solution: Building trust through transparency and human oversight rather than just better models. Modern parsing platforms address this by:

    • Providing confidence scores: Every extracted field includes certainty scores, enabling automatic flagging of anything below defined thresholds for review
    • Visual grounding: Linking extracted data back to precise original document locations for instant verification
    • Human-in-the-loop workflows: Creating seamless processes where low-confidence or flagged documents automatically route to humans for verification

    Real-world success: UK-based Ascend Properties experienced explosive 50% year-over-year growth, but manual invoice processing couldn’t scale. They needed trustworthy systems to handle volume without a massive data entry team expansion. Implementing AI platforms with reliable human-in-the-loop workflows, automated processes, and avoiding hiring four additional full-time employees, saving over 80% in processing costs.

    “Our business grew 5x in the last 4 years; to process invoices manually would mean a 5x increase in staff. This was neither cost-effective nor a scalable way to grow. Nanonets helped us avoid such an increase in staff.” — David Giovanni, CEO

    These real-world examples demonstrate that while challenges are significant, practical solutions exist and deliver measurable business value when properly implemented.


    Final thoughts

    The field is evolving rapidly toward document reasoning rather than simple parsing. We’re entering an era of agentic AI systems that will not only extract data but also reason about it, answer complex questions, summarize content across multiple documents, and perform actions based on what they read.

    Imagine an agent that reads new vendor contracts, compares terms against company legal policies, flags non-compliant clauses, and drafts summary emails to legal teams — all automatically. This future is closer than you might think.

    The foundation you build today with robust document parsing will enable these advanced capabilities tomorrow. Whether you choose open-source libraries for maximum control or commercial platforms for immediate productivity, the key is starting with clean, accurate data extraction that can evolve with emerging technologies.


    FAQs

    What is the difference between document parsing and OCR?

    Optical Character Recognition (OCR) is the foundational technology that converts the text in an image into machine-readable characters. Think of it as transcription. Document parsing is the next layer of intelligence; it takes that raw text and analyzes the document’s layout and context to understand its structure, identifying and extracting specific data fields like an invoice_number or a due_date into an organized format. OCR reads the words; parsing understands what they mean.

    Should I use an open-source library or a commercial platform for document parsing?

    The choice depends on your team’s resources and goals. Open-source libraries (like docstrange) are ideal for development teams who need maximum control and flexibility to build a custom solution, but they require significant engineering effort to maintain. Commercial platforms (like Nanonets) are better for businesses that need a reliable, secure, and ready-to-use solution with a full automated workflow, including a user interface, integrations, and support, without the heavy engineering lift.

    How do modern tools handle complex tables that span multiple pages?

    This is a classic failure point for older tools, but modern parsers solve this using visual layout understanding. Vision-Language Models (VLMs) don’t just read text page by page; they see the document visually. They recognize a table as a single object and can track its structure across a page break, correctly associating the rows on the second page with the headers from the first.

    Can document parsing automate invoice processing for an accounts payable team?

    Yes, this is one of the most common and high-value use cases. A modern document parsing workflow can completely automate the AP process by:

    • Automatically ingesting invoices from an email inbox.
    • Using a pre-trained AI model to accurately extract all necessary data, including line items.
    • Validating the data with custom rules (e.g., flagging invoices over a certain amount).
    • Exporting the verified data directly into accounting software like QuickBooks or an ERP system.

    This process, as demonstrated by companies like Hometown Holdings, can save thousands of employee hours annually and significantly increase operational income.

    What is a “zero-shot” document parsing model?

    A “zero-shot” model is an AI model that can extract information from a document format it has never been specifically trained on. Instead of needing 10-15 examples to learn a new document type, you can simply provide it with a clear, text-based description (a “prompt”) for the field you want to find. For example, you can tell it, “Find the final date by which the payment must be made,” and the model will use its broad understanding of documents to locate and extract the due_date.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleThe Unsung Hero of Automation: A Guide to Automated Document Processing (ADP)
    Next Article Unstructured data extraction made easy: A how-to guide

    Related Posts

    Artificial Intelligence

    Scaling Up Reinforcement Learning for Traffic Smoothing: A 100-AV Highway Deployment

    September 6, 2025
    Repurposing Protein Folding Models for Generation with Latent Diffusion
    Artificial Intelligence

    Repurposing Protein Folding Models for Generation with Latent Diffusion

    September 6, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    I replaced my Pixel 9 Pro with this $700 Android phone, and didn’t mind that it’s for gamers

    News & Updates

    CVE-2025-4669 – WordPress Booking Calendar Stored Cross-Site Scripting (XSS) Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Celebrating an academic-industry collaboration to advance vehicle technology

    Artificial Intelligence

    pfSense – firewall and routing platform

    Linux

    Highlights

    Development

    Building Trust and Shaping the Future: Implementing Responsible AI – Part 2

    June 27, 2025

    In Part 1 we’ve talked about why we urgently need to make sure AI is…

    Databricks Lakebase – Database Branching in Action

    July 4, 2025

    CVE-2025-31253 – Apple FaceTime Audio Muting Vulnerability

    May 12, 2025

    CVE-2025-1048 – Sonos Era 300 Speaker SMB Use-After-Free Remote Code Execution Vulnerability

    April 23, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.