A Code Implementation to Efficiently Leverage LangChain to Automate PubMed Literature Searches, Parsing, and Trend Visualization

In this tutorial, we are excited to introduce the Advanced PubMed Research Assistant, which guides you through building a streamlined pipeline for querying and analyzing biomedical literature. In this tutorial, we focus on leveraging the PubmedQueryRun tool to perform targeted searches, such as “CRISPR gene editing,” and then parse, cache, and explore those results. You’ll learn how to extract publication dates, titles, and summaries; store queries for instant reuse; and prepare your data for visualization or further analysis.

Copy CodeCopiedUse a different Browser

!pip install -q langchain-community xmltodict pandas matplotlib seaborn wordcloud google-generativeai langchain-google-genai


import os
import re
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
from collections import Counter
from wordcloud import WordCloud
import warnings
warnings.filterwarnings('ignore')


from langchain_community.tools.pubmed.tool import PubmedQueryRun
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.agents import initialize_agent, Tool
from langchain.agents import AgentType

We install and configure all the essential Python packages, including langchain-community, xmltodict, pandas, matplotlib, seaborn, and wordcloud, as well as Google Generative AI and LangChain Google integrations. We import core data‑processing and visualization libraries, silence warnings, and bring in the PubmedQueryRun tool and ChatGoogleGenerativeAI client. Finally, we prepare to initialize our LangChain agent with the PubMed search capability.

Copy CodeCopiedUse a different Browser

class AdvancedPubMedResearcher:
    """Advanced PubMed research assistant with analysis capabilities"""
   
    def __init__(self, gemini_api_key=None):
        """Initialize the researcher with optional Gemini integration"""
        self.pubmed_tool = PubmedQueryRun()
        self.research_cache = {}
       
        if gemini_api_key:
            os.environ["GOOGLE_API_KEY"] = gemini_api_key
            self.llm = ChatGoogleGenerativeAI(
                model="gemini-1.5-flash",
                temperature=0,
                convert_system_message_to_human=True
            )
            self.agent = self._create_agent()
        else:
            self.llm = None
            self.agent = None
   
    def _create_agent(self):
        """Create LangChain agent with PubMed tool"""
        tools = [
            Tool(
                name="PubMed Search",
                func=self.pubmed_tool.invoke,
                description="Search PubMed for biomedical literature. Use specific terms."
            )
        ]
       
        return initialize_agent(
            tools,
            self.llm,
            agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
            verbose=True
        )
   
    def search_papers(self, query, max_results=5):
        """Search PubMed and parse results"""
        print(f"<img src="https://s.w.org/images/core/emoji/16.0.1/72x72/1f50d.png" alt="🔍" class="wp-smiley"/> Searching PubMed for: '{query}'")
       
        try:
            results = self.pubmed_tool.invoke(query)
            papers = self._parse_pubmed_results(results)
           
            self.research_cache[query] = {
                'papers': papers,
                'timestamp': datetime.now(),
                'query': query
            }
           
            print(f"<img src="https://s.w.org/images/core/emoji/16.0.1/72x72/2705.png" alt="✅" class="wp-smiley"/> Found {len(papers)} papers")
            return papers
           
        except Exception as e:
            print(f"<img src="https://s.w.org/images/core/emoji/16.0.1/72x72/274c.png" alt="❌" class="wp-smiley"/> Error searching PubMed: {str(e)}")
            return []
   
    def _parse_pubmed_results(self, results):
        """Parse PubMed search results into structured data"""
        papers = []
       
        publications = results.split('nnPublished: ')[1:]
       
        for pub in publications:
            try:
                lines = pub.strip().split('n')
               
                pub_date = lines[0] if lines else "Unknown"
               
                title_line = next((line for line in lines if line.startswith('Title: ')), '')
                title = title_line.replace('Title: ', '') if title_line else "Unknown Title"
               
                summary_start = None
                for i, line in enumerate(lines):
                    if 'Summary::' in line:
                        summary_start = i + 1
                        break
               
                summary = ""
                if summary_start:
                    summary = ' '.join(lines[summary_start:])
               
                papers.append({
                    'date': pub_date,
                    'title': title,
                    'summary': summary,
                    'word_count': len(summary.split()) if summary else 0
                })
               
            except Exception as e:
                print(f"<img src="https://s.w.org/images/core/emoji/16.0.1/72x72/26a0.png" alt="⚠" class="wp-smiley"/> Error parsing paper: {str(e)}")
                continue
       
        return papers
   
    def analyze_research_trends(self, queries):
        """Analyze trends across multiple research topics"""
        print("<img src="https://s.w.org/images/core/emoji/16.0.1/72x72/1f4ca.png" alt="📊" class="wp-smiley"/> Analyzing research trends...")
       
        all_papers = []
        topic_counts = {}
       
        for query in queries:
            papers = self.search_papers(query, max_results=3)
            topic_counts[query] = len(papers)
           
            for paper in papers:
                paper['topic'] = query
                all_papers.append(paper)
       
        df = pd.DataFrame(all_papers)
       
        if df.empty:
            print("<img src="https://s.w.org/images/core/emoji/16.0.1/72x72/274c.png" alt="❌" class="wp-smiley"/> No papers found for analysis")
            return None
       
        self._create_visualizations(df, topic_counts)
       
        return df
   
    def _create_visualizations(self, df, topic_counts):
        """Create research trend visualizations"""
        plt.style.use('seaborn-v0_8')
        fig, axes = plt.subplots(2, 2, figsize=(15, 12))
        fig.suptitle('PubMed Research Analysis Dashboard', fontsize=16, fontweight='bold')
       
        topics = list(topic_counts.keys())
        counts = list(topic_counts.values())
       
        axes[0,0].bar(range(len(topics)), counts, color='skyblue', alpha=0.7)
        axes[0,0].set_xlabel('Research Topics')
        axes[0,0].set_ylabel('Number of Papers')
        axes[0,0].set_title('Papers Found by Topic')
        axes[0,0].set_xticks(range(len(topics)))
        axes[0,0].set_xticklabels([t[:20]+'...' if len(t)>20 else t for t in topics], rotation=45, ha='right')
       
        if 'word_count' in df.columns and not df['word_count'].empty:
            axes[0,1].hist(df['word_count'], bins=10, color='lightcoral', alpha=0.7)
            axes[0,1].set_xlabel('Abstract Word Count')
            axes[0,1].set_ylabel('Frequency')
            axes[0,1].set_title('Distribution of Abstract Lengths')
       
        try:
            dates = pd.to_datetime(df['date'], errors='coerce')
            valid_dates = dates.dropna()
            if not valid_dates.empty:
                axes[1,0].hist(valid_dates, bins=10, color='lightgreen', alpha=0.7)
                axes[1,0].set_xlabel('Publication Date')
                axes[1,0].set_ylabel('Number of Papers')
                axes[1,0].set_title('Publication Timeline')
                plt.setp(axes[1,0].xaxis.get_majorticklabels(), rotation=45)
        except:
            axes[1,0].text(0.5, 0.5, 'Date parsing unavailable', ha='center', va='center', transform=axes[1,0].transAxes)
       
        all_titles = ' '.join(df['title'].fillna('').astype(str))
        if all_titles.strip():
            clean_titles = re.sub(r'[^a-zA-Zs]', '', all_titles.lower())
           
            try:
                wordcloud = WordCloud(width=400, height=300, background_color='white',
                                    max_words=50, colormap='viridis').generate(clean_titles)
                axes[1,1].imshow(wordcloud, interpolation='bilinear')
                axes[1,1].axis('off')
                axes[1,1].set_title('Common Words in Titles')
            except:
                axes[1,1].text(0.5, 0.5, 'Word cloud unavailable', ha='center', va='center', transform=axes[1,1].transAxes)
       
        plt.tight_layout()
        plt.show()
   
    def comparative_analysis(self, topic1, topic2):
        """Compare two research topics"""
        print(f"<img src="https://s.w.org/images/core/emoji/16.0.1/72x72/1f52c.png" alt="🔬" class="wp-smiley"/> Comparing '{topic1}' vs '{topic2}'")
       
        papers1 = self.search_papers(topic1)
        papers2 = self.search_papers(topic2)
       
        avg_length1 = sum(p['word_count'] for p in papers1) / len(papers1) if papers1 else 0
        avg_length2 = sum(p['word_count'] for p in papers2) / len(papers2) if papers2 else 0
       
        print("n<img src="https://s.w.org/images/core/emoji/16.0.1/72x72/1f4c8.png" alt="📈" class="wp-smiley"/> Comparison Results:")
        print(f"Topic 1 ({topic1}):")
        print(f"  - Papers found: {len(papers1)}")
        print(f"  - Avg abstract length: {avg_length1:.1f} words")
       
        print(f"nTopic 2 ({topic2}):")
        print(f"  - Papers found: {len(papers2)}")
        print(f"  - Avg abstract length: {avg_length2:.1f} words")
       
        return papers1, papers2
   
    def intelligent_query(self, question):
        """Use AI agent to answer research questions (requires Gemini API)"""
        if not self.agent:
            print("<img src="https://s.w.org/images/core/emoji/16.0.1/72x72/274c.png" alt="❌" class="wp-smiley"/> AI agent not available. Please provide Gemini API key.")
            print("<img src="https://s.w.org/images/core/emoji/16.0.1/72x72/1f4a1.png" alt="💡" class="wp-smiley"/> Get free API key at: https://makersuite.google.com/app/apikey")
            return None
       
        print(f"<img src="https://s.w.org/images/core/emoji/16.0.1/72x72/1f916.png" alt="🤖" class="wp-smiley"/> Processing intelligent query with Gemini: '{question}'")
        try:
            response = self.agent.run(question)
            return response
        except Exception as e:
            print(f"<img src="https://s.w.org/images/core/emoji/16.0.1/72x72/274c.png" alt="❌" class="wp-smiley"/> Error with AI query: {str(e)}")
            return None

We encapsulate the PubMed querying workflow in our AdvancedPubMedResearcher class, initializing the PubmedQueryRun tool and an optional Gemini-powered LLM agent for advanced analysis. We provide methods to search for papers, parse and cache results, analyze research trends with rich visualizations, and compare topics side by side. This class streamlines programmatic exploration of biomedical literature and intelligent querying in just a few method calls.

Copy CodeCopiedUse a different Browser

def main():
    """Main tutorial demonstration"""
    print("<img src="https://s.w.org/images/core/emoji/16.0.1/72x72/1f680.png" alt="🚀" class="wp-smiley"/> Advanced PubMed Research Assistant Tutorial")
    print("=" * 50)
   
    # Initialize researcher
    # Uncomment next line and add your free Gemini API key for AI features
    # Get your free API key at: https://makersuite.google.com/app/apikey
    # researcher = AdvancedPubMedResearcher(gemini_api_key="your-gemini-api-key")
    researcher = AdvancedPubMedResearcher()
   
    print("n1⃣ Basic PubMed Search")
    papers = researcher.search_papers("CRISPR gene editing", max_results=3)
   
    if papers:
        print(f"nFirst paper preview:")
        print(f"Title: {papers[0]['title']}")
        print(f"Date: {papers[0]['date']}")
        print(f"Summary preview: {papers[0]['summary'][:200]}...")


    print("nn2⃣ Research Trends Analysis")
    research_topics = [
        "machine learning healthcare",
        "CRISPR gene editing",
        "COVID-19 vaccine"
    ]
   
    df = researcher.analyze_research_trends(research_topics)
   
    if df is not None:
        print(f"nDataFrame shape: {df.shape}")
        print("nSample data:")
        print(df[['topic', 'title', 'word_count']].head())


    print("nn3⃣ Comparative Analysis")
    papers1, papers2 = researcher.comparative_analysis(
        "artificial intelligence diagnosis",
        "traditional diagnostic methods"
    )
   
    print("nn4⃣ Advanced Features")
    print("Cache contents:", list(researcher.research_cache.keys()))
   
    if researcher.research_cache:
        latest_query = list(researcher.research_cache.keys())[-1]
        cached_data = researcher.research_cache[latest_query]
        print(f"Latest cached query: '{latest_query}'")
        print(f"Cached papers count: {len(cached_data['papers'])}")
   
    print("n<img src="https://s.w.org/images/core/emoji/16.0.1/72x72/2705.png" alt="✅" class="wp-smiley"/> Tutorial complete!")
    print("nNext steps:")
    print("- Add your FREE Gemini API key for AI-powered analysis")
    print("  Get it at: https://makersuite.google.com/app/apikey")
    print("- Customize queries for your research domain")
    print("- Export results to CSV with: df.to_csv('research_results.csv')")
   
    print("n<img src="https://s.w.org/images/core/emoji/16.0.1/72x72/1f381.png" alt="🎁" class="wp-smiley"/> Bonus: To test AI features, run:")
    print("researcher = AdvancedPubMedResearcher(gemini_api_key='your-key')")
    print("response = researcher.intelligent_query('What are the latest breakthrough in cancer treatment?')")
    print("print(response)")


if __name__ == "__main__":
    main()

We implement the main function to orchestrate the full tutorial demo, guiding users through basic PubMed searches, multi‑topic trend analyses, comparative studies, and cache inspection in a clear, numbered sequence. We wrap up by highlighting the next steps, including adding your Gemini API key for AI features, customizing queries to your domain, and exporting results to CSV, along with a bonus snippet for running intelligent, Gemini-powered research queries.

In conclusion, we have now demonstrated how to harness the power of PubMed programmatically, from crafting precise search queries to parsing and caching results for quick retrieval. By following these steps, you can automate your literature review process, track research trends over time, and integrate advanced analyses into your workflows. We encourage you to experiment with different search terms, dive into the cached results, and extend this framework to support your ongoing biomedical research.

Check out the CODES here. All credit for this research goes to the researchers of this project.

Meet the AI Dev Newsletter read by 40k+ Devs and Researchers from NVIDIA, OpenAI, DeepMind, Meta, Microsoft, JP Morgan Chase, Amgen, Aflac, Wells Fargo and 100s more [SUBSCRIBE NOW]

The post A Code Implementation to Efficiently Leverage LangChain to Automate PubMed Literature Searches, Parsing, and Trend Visualization appeared first on MarkTechPost.

Source: Read MoreÂ

From Data To Decisions: UX Strategies For Real-Time Dashboards

Honeycomb launches AI observability suite for developers

Low-Code vs No-Code Platforms for Node.js: What CTOs Must Know Before Investing

ServiceNow unveils Zurich AI platform

Building personal apps with open source and AI

What Can We Actually Do With corner-shape?

Craft, Clarity, and Care: The Story and Work of Mengchu Yao

Distribution Release: Q4OS 6.1

Optimizely Mission Control – Part III

Optimizely Mission Control – Part III

Learning from PHP Log to File Example

Online EMI Calculator using PHP – Calculate Loan EMI, Interest, and Amortization Schedule

sudo vs sudo-rs: What You Need to Know About the Rust Takeover of Classic Sudo Command

sudo vs sudo-rs: What You Need to Know About the Rust Takeover of Classic Sudo Command

Dmitry — The Deep Magic

Right way to record and share our Terminal sessions

A Code Implementation to Efficiently Leverage LangChain to Automate PubMed Literature Searches, Parsing, and Trend Visualization

How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

Announcing the new cluster creation experience for Amazon SageMaker HyperPod

NativePHP Hit $100K — And We’re Just Getting Started 🚀

WooCommerce tip: How to manage discounts based on taxonomies

Windows 11 24H2 PCs will finally deliver the best of USB-C, just like MacBook

Linux Kernel 6.16-rc4 Released: Focus on Filesystem Fixes, Driver Improvements, & Hardware Support

CVE-2025-4715 – Campcodes Sales and Inventory System SQL Injection Vulnerability

Motherhood and Career Balance in Tech: Stories from Perficient LATAM

How to Create Serverless AI Agents with Langbase Docs MCP Server in Minutes

Europol targets Kremlin-backed cybercrime gang NoName057(16)

A Code Implementation to Efficiently Leverage LangChain to Automate PubMed Literature Searches, Parsing, and Trend Visualization

Related Posts