Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      Tenable updates Vulnerability Priority Rating scoring method to flag fewer vulnerabilities as critical

      July 24, 2025

      Google adds updated workspace templates in Firebase Studio that leverage new Agent mode

      July 24, 2025

      AI and its impact on the developer experience, or ‘where is the joy?’

      July 23, 2025

      Google launches OSS Rebuild tool to improve trust in open source packages

      July 23, 2025

      EcoFlow’s new portable battery stations are lighter and more powerful (DC plug included)

      July 24, 2025

      7 ways Linux can save you money

      July 24, 2025

      My favorite Kindle tablet just got a kids model, and it makes so much sense

      July 24, 2025

      You can turn your Google Photos into video clips now – here’s how

      July 24, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Blade Service Injection: Direct Service Access in Laravel Templates

      July 24, 2025
      Recent

      Blade Service Injection: Direct Service Access in Laravel Templates

      July 24, 2025

      This Week in Laravel: NativePHP Mobile and AI Guidelines from Spatie

      July 24, 2025

      Retrieve the Currently Executing Closure in PHP 8.5

      July 24, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      FOSS Weekly #25.30: AUR Poisoned, Linux Rising, PPA Explained, New Open Source Grammar Checker and More

      July 24, 2025
      Recent

      FOSS Weekly #25.30: AUR Poisoned, Linux Rising, PPA Explained, New Open Source Grammar Checker and More

      July 24, 2025

      How to Open Control Panel in Windows 11

      July 24, 2025

      How to Shut Down Windows 11

      July 24, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»A Coding Guide to Asynchronous Web Data Extraction Using Crawl4AI: An Open-Source Web Crawling and Scraping Toolkit Designed for LLM Workflows

    A Coding Guide to Asynchronous Web Data Extraction Using Crawl4AI: An Open-Source Web Crawling and Scraping Toolkit Designed for LLM Workflows

    April 24, 2025

    In this tutorial, we demonstrate how to harness Crawl4AI, a modern, Python‑based web crawling toolkit, to extract structured data from web pages directly within Google Colab. Leveraging the power of asyncio for asynchronous I/O, httpx for HTTP requests, and Crawl4AI’s built‑in AsyncHTTPCrawlerStrategy, we bypass the overhead of headless browsers while still parsing complex HTML via JsonCssExtractionStrategy. With just a few lines of code, you install dependencies (crawl4ai, httpx), configure HTTPCrawlerConfig to request only gzip/deflate (avoiding Brotli issues), define your CSS‑to‑JSON schema, and orchestrate the crawl through AsyncWebCrawler and CrawlerRunConfig. Finally, the extracted JSON data is loaded into pandas for immediate analysis or export. 

    What sets Crawl4AI apart is its unified API, which seamlessly switches between browser-based (Playwright) and HTTP-only strategies, its robust error-handling hooks, and its declarative extraction schemas. Unlike traditional headless-browser workflows, Crawl4AI allows you to choose the most lightweight and performant backend, making it ideal for scalable data pipelines, on-the-fly ETL in notebooks, or feeding LLMs and analytics tools with clean JSON/CSV outputs.

    Copy CodeCopiedUse a different Browser
    !pip install -U crawl4ai httpx

    First, we install (or upgrade) Crawl4AI, the core asynchronous crawling framework, alongside HTTPX. This high-performance HTTP client provides all the building blocks we need for lightweight, asynchronous web scraping directly in Colab.

    Copy CodeCopiedUse a different Browser
    import asyncio, json, pandas as pd
    from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, HTTPCrawlerConfig
    from crawl4ai.async_crawler_strategy import AsyncHTTPCrawlerStrategy
    from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

    We bring in Python’s core async and data‑handling modules, asyncio for concurrency, json for parsing, and pandas for tabular storage, alongside Crawl4AI’s essentials: AsyncWebCrawler to drive the crawl, CrawlerRunConfig and HTTPCrawlerConfig to configure extraction and HTTP settings, AsyncHTTPCrawlerStrategy for a browser‑free HTTP backend, and JsonCssExtractionStrategy to map CSS selectors into structured JSON.

    Copy CodeCopiedUse a different Browser
    http_cfg = HTTPCrawlerConfig(
        method="GET",
        headers={
            "User-Agent":      "crawl4ai-bot/1.0",
            "Accept-Encoding": "gzip, deflate"
        },
        follow_redirects=True,
        verify_ssl=True
    )
    crawler_strategy = AsyncHTTPCrawlerStrategy(browser_config=http_cfg)
    

    Here, we instantiate an HTTPCrawlerConfig to define our HTTP crawler’s behavior, using a GET request with a custom User-Agent, gzip/deflate encoding only, automatic redirects, and SSL verification. We then plug that into AsyncHTTPCrawlerStrategy, allowing Crawl4AI to drive the crawl via pure HTTP calls rather than a full browser.

    Copy CodeCopiedUse a different Browser
    schema = {
        "name": "Quotes",
        "baseSelector": "div.quote",
        "fields": [
            {"name": "quote",  "selector": "span.text",      "type": "text"},
            {"name": "author", "selector": "small.author",   "type": "text"},
            {"name": "tags",   "selector": "div.tags a.tag", "type": "text"}
        ]
    }
    extraction_strategy = JsonCssExtractionStrategy(schema, verbose=False)
    run_cfg = CrawlerRunConfig(extraction_strategy=extraction_strategy)
    

    We define a JSON‑CSS extraction schema targeting each quote block (div.quote) and its child elements (span.text, small.author, div.tags a.tag), then initializes a JsonCssExtractionStrategy with that schema, and wraps it in a CrawlerRunConfig so Crawl4AI knows exactly what structured data to pull on each request.

    Copy CodeCopiedUse a different Browser
    async def crawl_quotes_http(max_pages=5):
        all_items = []
        async with AsyncWebCrawler(crawler_strategy=crawler_strategy) as crawler:
            for p in range(1, max_pages+1):
                url = f"https://quotes.toscrape.com/page/{p}/"
                try:
                    res = await crawler.arun(url=url, config=run_cfg)
                except Exception as e:
                    print(f"❌ Page {p} failed outright: {e}")
                    continue
    
    
                if not res.extracted_content:
                    print(f"❌ Page {p} returned no content, skipping")
                    continue
    
    
                try:
                    items = json.loads(res.extracted_content)
                except Exception as e:
                    print(f"❌ Page {p} JSON‑parse error: {e}")
                    continue
    
    
                print(f"✅ Page {p}: {len(items)} quotes")
                all_items.extend(items)
    
    
        return pd.DataFrame(all_items)

    Now, this asynchronous function orchestrates the HTTP‑only crawl: it spins up an AsyncWebCrawler with our AsyncHTTPCrawlerStrategy, iterates through each page URL, and safely awaits crawler.arun(), handles any request or JSON parsing errors and collects the extracted quote records into a single pandas DataFrame for downstream analysis.

    Copy CodeCopiedUse a different Browser
    df = asyncio.get_event_loop().run_until_complete(crawl_quotes_http(max_pages=3))
    df.head()
    

    Finally, we kick off the crawl_quotes_http coroutine on Colab’s existing asyncio loop, fetching three pages of quotes, and then display the first few rows of the resulting pandas DataFrame to verify that our crawler returned structured data as expected.

    In conclusion, by combining Google Colab’s zero-config environment with Python’s asynchronous ecosystem and Crawl4AI’s flexible crawling strategies, we have now developed a fully automated pipeline for scraping and structuring web data in minutes. Whether you need to spin up a quick dataset of quotes, build a refreshable news‑article archive, or power a RAG workflow, Crawl4AI’s blend of httpx, asyncio, JsonCssExtractionStrategy, and AsyncHTTPCrawlerStrategy delivers both simplicity and scalability. Beyond pure HTTP crawls, you can instantly pivot to Playwright‑driven browser automation without rewriting your extraction logic, underscoring why Crawl4AI stands out as the go‑to framework for modern, production‑ready web data extraction.


    Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

    🔥 [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

    The post A Coding Guide to Asynchronous Web Data Extraction Using Crawl4AI: An Open-Source Web Crawling and Scraping Toolkit Designed for LLM Workflows appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleA New Citibank Report/Guide Shares How Agentic AI Will Reshape Finance with Autonomous Analysis and Intelligent Automation
    Next Article Sequential-NIAH: A Benchmark for Evaluating LLMs in Extracting Sequential Information from Long Texts

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    July 24, 2025
    Machine Learning

    AI Guardrails and Trustworthy LLM Evaluation: Building Responsible AI Systems

    July 24, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    elementary OS Preview Some Cool Upcoming Features

    Linux

    Leveraging Credentials As Unique Identifiers: A Pragmatic Approach To NHI Inventories 

    Development

    IOT and API Integration With MuleSoft: The Road to Seamless Connectivity

    Development

    CVE-2025-52935 – DragonflyDB Redis Lua Integer Overflow

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    Development

    How Incremental Static Regeneration (ISR) Works in Next.js

    May 1, 2025

    When you build a website, you often have two main choices for how pages are…

    Demystifying AI: What Every Business Leader Needs to Know to Stay Ahead🤖

    May 15, 2025

    CVE-2025-45250 – MrDoc SSRF Vulnerability

    May 6, 2025

    CVE-2025-5534 – “ESV Bible Shortcode for WordPress Stored Cross-Site Scripting”

    June 6, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.