Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 9, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 9, 2025

      CodeSOD: A Real POS Report

      June 9, 2025

      Decoding The SVG path Element: Line Commands

      June 9, 2025

      Apple doesn’t need better AI as much as AI needs Apple to bring its A-game

      June 8, 2025

      DistroWatch Weekly, Issue 1125

      June 8, 2025

      Motion Highlights #9

      June 8, 2025

      The 2025 Wholesome Direct was chock-full of cozy casual games and aesthetic vibes

      June 8, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      GuacPanel

      June 9, 2025
      Recent

      GuacPanel

      June 9, 2025

      FilamentExamples.com: Our Demo-Projects and Tutorials on Filament

      June 9, 2025

      Laravel Migration With Schema Validation in MongoDB

      June 9, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Raspberry Pi 5 Desktop Mini PC: Installing Software

      June 9, 2025
      Recent

      Raspberry Pi 5 Desktop Mini PC: Installing Software

      June 9, 2025

      SmartOS – Type 1 Hypervisor platform based on illumos

      June 9, 2025

      Karakeep is a self-hostable bookmark-everything app

      June 9, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»A Coding Guide to Asynchronous Web Data Extraction Using Crawl4AI: An Open-Source Web Crawling and Scraping Toolkit Designed for LLM Workflows

    A Coding Guide to Asynchronous Web Data Extraction Using Crawl4AI: An Open-Source Web Crawling and Scraping Toolkit Designed for LLM Workflows

    April 24, 2025

    In this tutorial, we demonstrate how to harness Crawl4AI, a modern, Python‑based web crawling toolkit, to extract structured data from web pages directly within Google Colab. Leveraging the power of asyncio for asynchronous I/O, httpx for HTTP requests, and Crawl4AI’s built‑in AsyncHTTPCrawlerStrategy, we bypass the overhead of headless browsers while still parsing complex HTML via JsonCssExtractionStrategy. With just a few lines of code, you install dependencies (crawl4ai, httpx), configure HTTPCrawlerConfig to request only gzip/deflate (avoiding Brotli issues), define your CSS‑to‑JSON schema, and orchestrate the crawl through AsyncWebCrawler and CrawlerRunConfig. Finally, the extracted JSON data is loaded into pandas for immediate analysis or export. 

    What sets Crawl4AI apart is its unified API, which seamlessly switches between browser-based (Playwright) and HTTP-only strategies, its robust error-handling hooks, and its declarative extraction schemas. Unlike traditional headless-browser workflows, Crawl4AI allows you to choose the most lightweight and performant backend, making it ideal for scalable data pipelines, on-the-fly ETL in notebooks, or feeding LLMs and analytics tools with clean JSON/CSV outputs.

    Copy CodeCopiedUse a different Browser
    !pip install -U crawl4ai httpx

    First, we install (or upgrade) Crawl4AI, the core asynchronous crawling framework, alongside HTTPX. This high-performance HTTP client provides all the building blocks we need for lightweight, asynchronous web scraping directly in Colab.

    Copy CodeCopiedUse a different Browser
    import asyncio, json, pandas as pd
    from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, HTTPCrawlerConfig
    from crawl4ai.async_crawler_strategy import AsyncHTTPCrawlerStrategy
    from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

    We bring in Python’s core async and data‑handling modules, asyncio for concurrency, json for parsing, and pandas for tabular storage, alongside Crawl4AI’s essentials: AsyncWebCrawler to drive the crawl, CrawlerRunConfig and HTTPCrawlerConfig to configure extraction and HTTP settings, AsyncHTTPCrawlerStrategy for a browser‑free HTTP backend, and JsonCssExtractionStrategy to map CSS selectors into structured JSON.

    Copy CodeCopiedUse a different Browser
    http_cfg = HTTPCrawlerConfig(
        method="GET",
        headers={
            "User-Agent":      "crawl4ai-bot/1.0",
            "Accept-Encoding": "gzip, deflate"
        },
        follow_redirects=True,
        verify_ssl=True
    )
    crawler_strategy = AsyncHTTPCrawlerStrategy(browser_config=http_cfg)
    

    Here, we instantiate an HTTPCrawlerConfig to define our HTTP crawler’s behavior, using a GET request with a custom User-Agent, gzip/deflate encoding only, automatic redirects, and SSL verification. We then plug that into AsyncHTTPCrawlerStrategy, allowing Crawl4AI to drive the crawl via pure HTTP calls rather than a full browser.

    Copy CodeCopiedUse a different Browser
    schema = {
        "name": "Quotes",
        "baseSelector": "div.quote",
        "fields": [
            {"name": "quote",  "selector": "span.text",      "type": "text"},
            {"name": "author", "selector": "small.author",   "type": "text"},
            {"name": "tags",   "selector": "div.tags a.tag", "type": "text"}
        ]
    }
    extraction_strategy = JsonCssExtractionStrategy(schema, verbose=False)
    run_cfg = CrawlerRunConfig(extraction_strategy=extraction_strategy)
    

    We define a JSON‑CSS extraction schema targeting each quote block (div.quote) and its child elements (span.text, small.author, div.tags a.tag), then initializes a JsonCssExtractionStrategy with that schema, and wraps it in a CrawlerRunConfig so Crawl4AI knows exactly what structured data to pull on each request.

    Copy CodeCopiedUse a different Browser
    async def crawl_quotes_http(max_pages=5):
        all_items = []
        async with AsyncWebCrawler(crawler_strategy=crawler_strategy) as crawler:
            for p in range(1, max_pages+1):
                url = f"https://quotes.toscrape.com/page/{p}/"
                try:
                    res = await crawler.arun(url=url, config=run_cfg)
                except Exception as e:
                    print(f"❌ Page {p} failed outright: {e}")
                    continue
    
    
                if not res.extracted_content:
                    print(f"❌ Page {p} returned no content, skipping")
                    continue
    
    
                try:
                    items = json.loads(res.extracted_content)
                except Exception as e:
                    print(f"❌ Page {p} JSON‑parse error: {e}")
                    continue
    
    
                print(f"✅ Page {p}: {len(items)} quotes")
                all_items.extend(items)
    
    
        return pd.DataFrame(all_items)

    Now, this asynchronous function orchestrates the HTTP‑only crawl: it spins up an AsyncWebCrawler with our AsyncHTTPCrawlerStrategy, iterates through each page URL, and safely awaits crawler.arun(), handles any request or JSON parsing errors and collects the extracted quote records into a single pandas DataFrame for downstream analysis.

    Copy CodeCopiedUse a different Browser
    df = asyncio.get_event_loop().run_until_complete(crawl_quotes_http(max_pages=3))
    df.head()
    

    Finally, we kick off the crawl_quotes_http coroutine on Colab’s existing asyncio loop, fetching three pages of quotes, and then display the first few rows of the resulting pandas DataFrame to verify that our crawler returned structured data as expected.

    In conclusion, by combining Google Colab’s zero-config environment with Python’s asynchronous ecosystem and Crawl4AI’s flexible crawling strategies, we have now developed a fully automated pipeline for scraping and structuring web data in minutes. Whether you need to spin up a quick dataset of quotes, build a refreshable news‑article archive, or power a RAG workflow, Crawl4AI’s blend of httpx, asyncio, JsonCssExtractionStrategy, and AsyncHTTPCrawlerStrategy delivers both simplicity and scalability. Beyond pure HTTP crawls, you can instantly pivot to Playwright‑driven browser automation without rewriting your extraction logic, underscoring why Crawl4AI stands out as the go‑to framework for modern, production‑ready web data extraction.


    Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

    🔥 [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

    The post A Coding Guide to Asynchronous Web Data Extraction Using Crawl4AI: An Open-Source Web Crawling and Scraping Toolkit Designed for LLM Workflows appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleA New Citibank Report/Guide Shares How Agentic AI Will Reshape Finance with Autonomous Analysis and Intelligent Automation
    Next Article Sequential-NIAH: A Benchmark for Evaluating LLMs in Extracting Sequential Information from Long Texts

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 9, 2025
    Machine Learning

    ALPHAONE: A Universal Test-Time Framework for Modulating Reasoning in AI Models

    June 9, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2025-1330 – IBM CICS TX DNS Code Injection

    Common Vulnerabilities and Exposures (CVEs)

    CVE-2025-5764 – Code-projects Laundry System Cross Site Scripting (XSS)

    Common Vulnerabilities and Exposures (CVEs)

    IT threat evolution in Q1 2025. Non-mobile statistics

    Security

    This AI Paper from ByteDance Introduces a Hybrid Reward System Combining Reasoning Task Verifiers (RTV) and a Generative Reward Model (GenRM) to Mitigate Reward Hacking

    Machine Learning

    Highlights

    A Coding Guide to Unlock mem0 Memory for Anthropic Claude Bot: Enabling Context-Rich Conversations

    May 10, 2025

    In this tutorial, we walk you through setting up a fully functional bot in Google…

    zplug is a next-generation plugin manager for zsh

    June 3, 2025

    Jest: How do you change the Snapshot Folder?

    April 17, 2025

    Meet VoltAgent: A TypeScript AI Framework for Building and Orchestrating Scalable AI Agents

    April 22, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.