Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Automating Design Systems: Tips And Resources For Getting Started

      August 6, 2025

      OpenAI releases two open weight reasoning models

      August 6, 2025

      Accelerate tool adoption with a developer experimentation framework

      August 6, 2025

      UX Job Interview Helpers

      August 5, 2025

      Yes, you can edit video like a pro on Linux – here are my 4 go-to apps

      August 6, 2025

      I tried Perplexity’s new reservation feature, and it surprised me with new dining spots to try

      August 6, 2025

      Your Samsung TV is getting a huge feature upgrade – 3 AI tools launching right now

      August 6, 2025

      This multi-card reader is one of the best investments I’ve made for my creative workflow

      August 6, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Fluent Object Operations with Laravel’s Enhanced Helper Utilities

      August 6, 2025
      Recent

      Fluent Object Operations with Laravel’s Enhanced Helper Utilities

      August 6, 2025

      Record and Replay Requests With Laravel ChronoTrace

      August 6, 2025

      How to Write Media Queries in Optimizely Configured Commerce (Spire)

      August 6, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Battlefield 6 Developers Confirm AI Bots Will Auto-fill Servers If Player Count Drops

      August 6, 2025
      Recent

      Battlefield 6 Developers Confirm AI Bots Will Auto-fill Servers If Player Count Drops

      August 6, 2025

      Canon imageFORMULA R40 Driver for Windows 11, 10 (Download)

      August 6, 2025

      Microsoft to End Support for Visual Studio 2015 This October

      August 6, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»Google AI Releases LangExtract: An Open Source Python Library that Extracts Structured Data from Unstructured Text Documents

    Google AI Releases LangExtract: An Open Source Python Library that Extracts Structured Data from Unstructured Text Documents

    August 6, 2025

    In today’s data-driven world, valuable insights are often buried in unstructured text—be it clinical notes, lengthy legal contracts, or customer feedback threads. Extracting meaningful, traceable information from these documents is both a technical and practical challenge. Google AI’s new open-source Python library, LangExtract, is designed to address this gap directly, using LLMs like Gemini to deliver powerful, automated extraction with traceability and transparency at its core.

    Key Innovations of LangExtract

    1. Declarative and Traceable Extraction

    LangExtract lets users define custom extraction tasks using natural language instructions and high-quality “few-shot” examples. This empowers developers and analysts to specify exactly which entities, relationships, or facts to extract, and in what structure. Crucially, every extracted piece of information is tied directly back to its source text—enabling validation, auditing, and end-to-end traceability.

    2. Domain Versatility

    The library works not just in tech demos but in critical real-world domains—including health (clinical notes, medical reports), finance (summaries, risk documents), law (contracts), research literature, and even the arts (analyzing Shakespeare). Original use cases include automatic extraction of medications, dosages, and administration details from clinical documents, as well as relationships and emotions from plays or literature.

    3. Schema Enforcement with LLMs

    Powered by Gemini and compatible with other LLMs, LangExtract enables enforcement of custom output schemas (like JSON), so results aren’t just accurate—they’re immediately usable in downstream databases, analytics, or AI pipelines. It solves traditional LLM weaknesses around hallucination and schema drift by grounding outputs to both user instructions and actual source text.

    4. Scalability and Visualization

    • Handles Large Volumes: LangExtract efficiently processes long documents by chunking, parallelizing, and aggregating results.
    • Interactive Visualization: Developers can generate interactive HTML reports, viewing each extracted entity with context by highlighting its location in the original document—making auditing and error analysis seamless.
    • Smooth Integration: Works in Google Colab, Jupyter, or as standalone HTML files, supporting a rapid feedback loop for developers and researchers.

    5. Installation and Usage

    Install easily with pip:

    Copy CodeCopiedUse a different Browser
    pip install langextract
    

    Example Workflow (Extracting Character Info from Shakespeare):

    Copy CodeCopiedUse a different Browser
    import langextract as lx
    import textwrap
    
    # 1. Define your prompt
    prompt = textwrap.dedent("""
    Extract characters, emotions, and relationships in order of appearance.
    Use exact text for extractions. Do not paraphrase or overlap entities.
    Provide meaningful attributes for each entity to add context.
    """)
    
    # 2. Give a high-quality example
    examples = [
        lx.data.ExampleData(
            text="ROMEO. But soft! What light through yonder window breaks? It is the east, and Juliet is the sun.",
            extractions=[
                lx.data.Extraction(extraction_class="character", extraction_text="ROMEO", attributes={"emotional_state": "wonder"}),
                lx.data.Extraction(extraction_class="emotion", extraction_text="But soft!", attributes={"feeling": "gentle awe"}),
                lx.data.Extraction(extraction_class="relationship", extraction_text="Juliet is the sun", attributes={"type": "metaphor"}),
            ],
        )
    ]
    
    # 3. Extract from new text
    input_text = "Lady Juliet gazed longingly at the stars, her heart aching for Romeo"
    
    result = lx.extract(
        text_or_documents=input_text,
        prompt_description=prompt,
        examples=examples,
        model_id="gemini-2.5-pro"
    )
    
    # 4. Save and visualize results
    lx.io.save_annotated_documents([result], output_name="extraction_results.jsonl")
    html_content = lx.visualize("extraction_results.jsonl")
    with open("visualization.html", "w") as f:
        f.write(html_content)
    

    This results in structured, source-anchored JSON outputs, plus an interactive HTML visualization for easy review and demonstration.

    Specialized & Real-World Applications

    • Medicine: Extracts medications, dosages, timing, and links them back to source sentences. Powered by insights from research conducted on accelerating medical information extraction, LangExtract’s approach is directly applicable to structuring clinical and radiology reports—improving clarity and supporting interoperability.
    • Finance & Law: Automatically pulls relevant clauses, terms, or risks from dense legal or financial text, ensuring every output can be traced back to its context.
    • Research & Data Mining: Streamlines high-throughput extraction from thousands of scientific papers.

    The team even provides a demonstration called RadExtract for structuring radiology reports—highlighting not just what was extracted, but exactly where the information appeared in the original input.

    How LangExtract Compares

    FeatureTraditional ApproachesLangExtract Approach
    Schema ConsistencyOften manual/error-proneEnforced via instructions & few-shot examples
    Result TraceabilityMinimalAll output linked to input text
    Scaling to Long TextsWindowed, lossyChunked + parallel extraction, then aggregation
    VisualizationCustom, usually absentBuilt-in, interactive HTML reports
    DeploymentRigid, model-specificGemini-first, open to other LLMs & on-premises

    In Summary

    LangExtract presents a new era for extracting structured, actionable data from text—delivering:

    • Declarative, explainable extraction
    • Traceable results backed by source context
    • Instant visualization for rapid iteration
    • Easy integration into any Python workflow

    Check out the GitHub Page and Technical Blog. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

    The post Google AI Releases LangExtract: An Open Source Python Library that Extracts Structured Data from Unstructured Text Documents appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleBuilding a Multi-Agent Conversational AI Framework with Microsoft AutoGen and Gemini API
    Next Article How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    August 6, 2025
    Machine Learning

    Building a Multi-Agent Conversational AI Framework with Microsoft AutoGen and Gemini API

    August 6, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2025-52437 – Cisco WebEx Meeting Server Cross-Site Request Forgery (CSRF)

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-8330 – Code-projects Vehicle Management SQL Injection

    Common Vulnerabilities and Exposures (CVEs)

    I never pay full price for PCs or Macs, thanks to these 7 money-saving tricks

    News & Updates

    Solo.io Launches Agent Gateway and Introduces Agent Mesh for Unified AI Connectivity

    Tech & Work

    Highlights

    CVE-2025-45488 – Linksys E5600 Command Injection Vulnerability

    May 6, 2025

    CVE ID : CVE-2025-45488

    Published : May 6, 2025, 4:15 p.m. | 3 hours, 19 minutes ago

    Description : Linksys E5600 v1.1.0.26 was discovered to contain a command injection vulnerability in the runtime.ddnsStatus DynDNS function via the mailex parameter.

    Severity: 0.0 | NA

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    CVE-2025-47657 – Productive Minds Productive Commerce SQL Injection

    May 7, 2025

    CVE-2025-34079 – NSClient++ Remote Code Execution Vulnerability

    July 2, 2025

    3 clever ChatGPT tricks that prove it’s still the AI to beat

    April 22, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.