Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Representative Line: Brace Yourself

      September 18, 2025

      Beyond the Pilot: A Playbook for Enterprise-Scale Agentic AI

      September 18, 2025

      GitHub launches MCP Registry to provide central location for trusted servers

      September 18, 2025

      MongoDB brings Search and Vector Search to self-managed versions of database

      September 18, 2025

      Distribution Release: Security Onion 2.4.180

      September 18, 2025

      Distribution Release: Omarchy 3.0.1

      September 17, 2025

      Distribution Release: Mauna Linux 25

      September 16, 2025

      Distribution Release: SparkyLinux 2025.09

      September 16, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      AI Momentum and Perficient’s Inclusion in Analyst Reports – Highlights From 2025 So Far

      September 18, 2025
      Recent

      AI Momentum and Perficient’s Inclusion in Analyst Reports – Highlights From 2025 So Far

      September 18, 2025

      Shopping Portal using Python Django & MySQL

      September 17, 2025

      Perficient Earns Adobe’s Real-time CDP Specialization

      September 17, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Valve Survey Reveals Slight Retreat in Steam-on-Linux Share

      September 18, 2025
      Recent

      Valve Survey Reveals Slight Retreat in Steam-on-Linux Share

      September 18, 2025

      Review: Elecrow’s All-in-one Starter Kit for Pico 2

      September 18, 2025

      FOSS Weekly #25.38: GNOME 49 Release, KDE Drama, sudo vs sudo-rs, Local AI on Android and More Linux Stuff

      September 18, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»How to Build an Advanced BrightData Web Scraper with Google Gemini for AI-Powered Data Extraction

    How to Build an Advanced BrightData Web Scraper with Google Gemini for AI-Powered Data Extraction

    June 18, 2025

    In this tutorial, we walk you through building an enhanced web scraping tool that leverages BrightData’s powerful proxy network alongside Google’s Gemini API for intelligent data extraction. You’ll see how to structure your Python project, install and import the necessary libraries, and encapsulate scraping logic within a clean, reusable BrightDataScraper class. Whether you’re targeting Amazon product pages, bestseller listings, or LinkedIn profiles, the scraper’s modular methods demonstrate how to configure scraping parameters, handle errors gracefully, and return structured JSON results. An optional React-style AI agent integration also shows you how to combine LLM-driven reasoning with real-time scraping, empowering you to pose natural language queries for on-the-fly data analysis.

    Copy CodeCopiedUse a different Browser
    !pip install langchain-brightdata langchain-google-genai langgraph langchain-core google-generativeai

    We install all of the key libraries needed for the tutorial in one step: langchain-brightdata for BrightData web scraping, langchain-google-genai and google-generativeai for Google Gemini integration, langgraph for agent orchestration, and langchain-core for the core LangChain framework.

    Copy CodeCopiedUse a different Browser
    import os
    import json
    from typing import Dict, Any, Optional
    from langchain_brightdata import BrightDataWebScraperAPI
    from langchain_google_genai import ChatGoogleGenerativeAI
    from langgraph.prebuilt import create_react_agent

    These imports prepare your environment and core functionality: os and json handle system operations and data serialization, while typing provides structured type hints. You then bring in BrightDataWebScraperAPI for BrightData scraping, ChatGoogleGenerativeAI to interface with Google’s Gemini LLM, and create_react_agent to orchestrate these components in a React-style agent.

    Copy CodeCopiedUse a different Browser
    class BrightDataScraper:
        """Enhanced web scraper using BrightData API"""
       
        def __init__(self, api_key: str, google_api_key: Optional[str] = None):
            """Initialize scraper with API keys"""
            self.api_key = api_key
            self.scraper = BrightDataWebScraperAPI(bright_data_api_key=api_key)
           
            if google_api_key:
                self.llm = ChatGoogleGenerativeAI(
                    model="gemini-2.0-flash",
                    google_api_key=google_api_key
                )
                self.agent = create_react_agent(self.llm, [self.scraper])
       
        def scrape_amazon_product(self, url: str, zipcode: str = "10001") -> Dict[str, Any]:
            """Scrape Amazon product data"""
            try:
                results = self.scraper.invoke({
                    "url": url,
                    "dataset_type": "amazon_product",
                    "zipcode": zipcode
                })
                return {"success": True, "data": results}
            except Exception as e:
                return {"success": False, "error": str(e)}
       
        def scrape_amazon_bestsellers(self, region: str = "in") -> Dict[str, Any]:
            """Scrape Amazon bestsellers"""
            try:
                url = f"https://www.amazon.{region}/gp/bestsellers/"
                results = self.scraper.invoke({
                    "url": url,
                    "dataset_type": "amazon_product"
                })
                return {"success": True, "data": results}
            except Exception as e:
                return {"success": False, "error": str(e)}
       
        def scrape_linkedin_profile(self, url: str) -> Dict[str, Any]:
            """Scrape LinkedIn profile data"""
            try:
                results = self.scraper.invoke({
                    "url": url,
                    "dataset_type": "linkedin_person_profile"
                })
                return {"success": True, "data": results}
            except Exception as e:
                return {"success": False, "error": str(e)}
       
        def run_agent_query(self, query: str) -> None:
            """Run AI agent with natural language query"""
            if not hasattr(self, 'agent'):
                print("Error: Google API key required for agent functionality")
                return
           
            try:
                for step in self.agent.stream(
                    {"messages": query},
                    stream_mode="values"
                ):
                    step["messages"][-1].pretty_print()
            except Exception as e:
                print(f"Agent error: {e}")
       
        def print_results(self, results: Dict[str, Any], title: str = "Results") -> None:
            """Pretty print results"""
            print(f"n{'='*50}")
            print(f"{title}")
            print(f"{'='*50}")
           
            if results["success"]:
                print(json.dumps(results["data"], indent=2, ensure_ascii=False))
            else:
                print(f"Error: {results['error']}")
            print()

    The BrightDataScraper class encapsulates all BrightData web-scraping logic and optional Gemini-powered intelligence under a single, reusable interface. Its methods enable you to easily fetch Amazon product details, bestseller lists, and LinkedIn profiles, handling API calls, error handling, and JSON formatting, and even stream natural-language “agent” queries when a Google API key is provided. A convenient print_results helper ensures your output is always cleanly formatted for inspection.

    Copy CodeCopiedUse a different Browser
    def main():
        """Main execution function"""
        BRIGHT_DATA_API_KEY = "Use Your Own API Key"
        GOOGLE_API_KEY = "Use Your Own API Key"
       
        scraper = BrightDataScraper(BRIGHT_DATA_API_KEY, GOOGLE_API_KEY)
       
        print("<img src="https://s.w.org/images/core/emoji/15.1.0/72x72/1f6cd.png" alt="🛍" class="wp-smiley" /> Scraping Amazon India Bestsellers...")
        bestsellers = scraper.scrape_amazon_bestsellers("in")
        scraper.print_results(bestsellers, "Amazon India Bestsellers")
       
        print("<img src="https://s.w.org/images/core/emoji/15.1.0/72x72/1f4e6.png" alt="📦" class="wp-smiley" /> Scraping Amazon Product...")
        product_url = "https://www.amazon.com/dp/B08L5TNJHG"
        product_data = scraper.scrape_amazon_product(product_url, "10001")
        scraper.print_results(product_data, "Amazon Product Data")
       
        print("<img src="https://s.w.org/images/core/emoji/15.1.0/72x72/1f464.png" alt="👤" class="wp-smiley" /> Scraping LinkedIn Profile...")
        linkedin_url = "https://www.linkedin.com/in/satyanadella/"
        linkedin_data = scraper.scrape_linkedin_profile(linkedin_url)
        scraper.print_results(linkedin_data, "LinkedIn Profile Data")
       
        print("<img src="https://s.w.org/images/core/emoji/15.1.0/72x72/1f916.png" alt="🤖" class="wp-smiley" /> Running AI Agent Query...")
        agent_query = """
        Scrape Amazon product data for https://www.amazon.com/dp/B0D2Q9397Y?th=1
        in New York (zipcode 10001) and summarize the key product details.
        """
        scraper.run_agent_query(agent_query)
    

    The main() function ties everything together by setting your BrightData and Google API keys, instantiating the BrightDataScraper, and then demonstrating each feature: it scrapes Amazon India’s bestsellers, fetches details for a specific product, retrieves a LinkedIn profile, and finally runs a natural-language agent query, printing neatly formatted results after each step.

    Copy CodeCopiedUse a different Browser
    if __name__ == "__main__":
        print("Installing required packages...")
        os.system("pip install -q langchain-brightdata langchain-google-genai langgraph")
       
        os.environ["BRIGHT_DATA_API_KEY"] = "Use Your Own API Key"
       
        main()
    

    Finally, this entry-point block ensures that, when run as a standalone script, the required scraping libraries are quietly installed, and the BrightData API key is set in the environment. Then the main function is executed to initiate all scraping and agent workflows.

    In conclusion, by the end of this tutorial, you’ll have a ready-to-use Python script that automates tedious data collection tasks, abstracts away low-level API details, and optionally taps into generative AI for advanced query handling. You can extend this foundation by adding support for other dataset types, integrating additional LLMs, or deploying the scraper as part of a larger data pipeline or web service. With these building blocks in place, you’re now equipped to gather, analyze, and present web data more efficiently, whether for market research, competitive intelligence, or custom AI-driven applications.


    Check out the Notebook. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

    The post How to Build an Advanced BrightData Web Scraper with Google Gemini for AI-Powered Data Extraction appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous Articlediceware is a passphrase generator
    Next Article Why Small Language Models (SLMs) Are Poised to Redefine Agentic AI: Efficiency, Cost, and Practical Deployment

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    September 3, 2025
    Machine Learning

    Announcing the new cluster creation experience for Amazon SageMaker HyperPod

    September 3, 2025
    Leave A Reply Cancel Reply

    Continue Reading

    Kritiek AMI MegaRAC SP-X authenticatie-lek misbruikt bij aanvallen

    Security

    CVE-2025-5878 – “ESAPI SQL Injection Defense Encoder Encoder.encodeForSQL Improper Neutralization”

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-4927 – PHPGurukul Online Marriage Registration System SQL Injection Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-5785 – Totolink X15 HTTP POST Request Handler Buffer Overflow Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    Laravel Eloquent: 8 Tools to Debug Slow SQL Queries

    June 17, 2025

    In Laravel and Eloquent, to avoid N+1 Query problems or to find slow SQL queries,…

    Xbox Free Play Days adds Towerborne, Bassmaster, and Construction Simulator this weekend

    June 13, 2025

    CVE-2025-30409 – Acronis Cyber Protect Cloud Agent Denial of Service

    April 24, 2025

    Mastering TypeScript: Your Ultimate Guide to Types, Inference & Compatibility

    June 18, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.