Close Menu
    DevStackTipsDevStackTips
    • Home
    • News & Updates
      1. Tech & Work
      2. View All

      How To Fix Largest Contentful Paint Issues With Subpart Analysis

      June 10, 2025

      How To Prevent WordPress SQL Injection Attacks

      June 10, 2025

      Development tool updates from WWDC: Foundation Models framework, Xcode 26, Swift 6.2, and more

      June 9, 2025

      Decoding The SVG path Element: Line Commands

      June 9, 2025

      Your favorite AI chatbot is lying to you all the time

      June 10, 2025

      Your CarPlay is getting a major overhaul on iOS 26: 3 new features arriving to your dashboard

      June 10, 2025

      Every iPad model that supports iPadOS 26 (and which ones won’t be compatible)

      June 10, 2025

      Why I recommend this Lenovo Android tablet to most people – especially at this price

      June 10, 2025
    • Development
      1. Algorithms & Data Structures
      2. Artificial Intelligence
      3. Back-End Development
      4. Databases
      5. Front-End Development
      6. Libraries & Frameworks
      7. Machine Learning
      8. Security
      9. Software Engineering
      10. Tools & IDEs
      11. Web Design
      12. Web Development
      13. Web Security
      14. Programming Languages
        • PHP
        • JavaScript
      Featured

      Community News: Latest PECL Releases (06.10.2025)

      June 10, 2025
      Recent

      Community News: Latest PECL Releases (06.10.2025)

      June 10, 2025

      SOLID Design Principles Every JavaScript Deveveloper Should Know

      June 10, 2025

      Perficient Boldly Advances Business Through Technology Partnerships and Collaborative Innovation

      June 10, 2025
    • Operating Systems
      1. Windows
      2. Linux
      3. macOS
      Featured

      Final Fantasy XVI launches on Xbox, and FF VII Remake Intergrade is coming this winter

      June 10, 2025
      Recent

      Final Fantasy XVI launches on Xbox, and FF VII Remake Intergrade is coming this winter

      June 10, 2025

      PowerToys Run adds Internet speed test, video downloader & more

      June 10, 2025

      Not Rumor Anymore: Persona 4 Revival Announced At Xbox Games Showcase 2025

      June 10, 2025
    • Learning Resources
      • Books
      • Cheatsheets
      • Tutorials & Guides
    Home»Development»Machine Learning»A Coding Guide to Asynchronous Web Data Extraction Using Crawl4AI: An Open-Source Web Crawling and Scraping Toolkit Designed for LLM Workflows

    A Coding Guide to Asynchronous Web Data Extraction Using Crawl4AI: An Open-Source Web Crawling and Scraping Toolkit Designed for LLM Workflows

    April 24, 2025

    In this tutorial, we demonstrate how to harness Crawl4AI, a modern, Python‑based web crawling toolkit, to extract structured data from web pages directly within Google Colab. Leveraging the power of asyncio for asynchronous I/O, httpx for HTTP requests, and Crawl4AI’s built‑in AsyncHTTPCrawlerStrategy, we bypass the overhead of headless browsers while still parsing complex HTML via JsonCssExtractionStrategy. With just a few lines of code, you install dependencies (crawl4ai, httpx), configure HTTPCrawlerConfig to request only gzip/deflate (avoiding Brotli issues), define your CSS‑to‑JSON schema, and orchestrate the crawl through AsyncWebCrawler and CrawlerRunConfig. Finally, the extracted JSON data is loaded into pandas for immediate analysis or export. 

    What sets Crawl4AI apart is its unified API, which seamlessly switches between browser-based (Playwright) and HTTP-only strategies, its robust error-handling hooks, and its declarative extraction schemas. Unlike traditional headless-browser workflows, Crawl4AI allows you to choose the most lightweight and performant backend, making it ideal for scalable data pipelines, on-the-fly ETL in notebooks, or feeding LLMs and analytics tools with clean JSON/CSV outputs.

    Copy CodeCopiedUse a different Browser
    !pip install -U crawl4ai httpx

    First, we install (or upgrade) Crawl4AI, the core asynchronous crawling framework, alongside HTTPX. This high-performance HTTP client provides all the building blocks we need for lightweight, asynchronous web scraping directly in Colab.

    Copy CodeCopiedUse a different Browser
    import asyncio, json, pandas as pd
    from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, HTTPCrawlerConfig
    from crawl4ai.async_crawler_strategy import AsyncHTTPCrawlerStrategy
    from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

    We bring in Python’s core async and data‑handling modules, asyncio for concurrency, json for parsing, and pandas for tabular storage, alongside Crawl4AI’s essentials: AsyncWebCrawler to drive the crawl, CrawlerRunConfig and HTTPCrawlerConfig to configure extraction and HTTP settings, AsyncHTTPCrawlerStrategy for a browser‑free HTTP backend, and JsonCssExtractionStrategy to map CSS selectors into structured JSON.

    Copy CodeCopiedUse a different Browser
    http_cfg = HTTPCrawlerConfig(
        method="GET",
        headers={
            "User-Agent":      "crawl4ai-bot/1.0",
            "Accept-Encoding": "gzip, deflate"
        },
        follow_redirects=True,
        verify_ssl=True
    )
    crawler_strategy = AsyncHTTPCrawlerStrategy(browser_config=http_cfg)
    

    Here, we instantiate an HTTPCrawlerConfig to define our HTTP crawler’s behavior, using a GET request with a custom User-Agent, gzip/deflate encoding only, automatic redirects, and SSL verification. We then plug that into AsyncHTTPCrawlerStrategy, allowing Crawl4AI to drive the crawl via pure HTTP calls rather than a full browser.

    Copy CodeCopiedUse a different Browser
    schema = {
        "name": "Quotes",
        "baseSelector": "div.quote",
        "fields": [
            {"name": "quote",  "selector": "span.text",      "type": "text"},
            {"name": "author", "selector": "small.author",   "type": "text"},
            {"name": "tags",   "selector": "div.tags a.tag", "type": "text"}
        ]
    }
    extraction_strategy = JsonCssExtractionStrategy(schema, verbose=False)
    run_cfg = CrawlerRunConfig(extraction_strategy=extraction_strategy)
    

    We define a JSON‑CSS extraction schema targeting each quote block (div.quote) and its child elements (span.text, small.author, div.tags a.tag), then initializes a JsonCssExtractionStrategy with that schema, and wraps it in a CrawlerRunConfig so Crawl4AI knows exactly what structured data to pull on each request.

    Copy CodeCopiedUse a different Browser
    async def crawl_quotes_http(max_pages=5):
        all_items = []
        async with AsyncWebCrawler(crawler_strategy=crawler_strategy) as crawler:
            for p in range(1, max_pages+1):
                url = f"https://quotes.toscrape.com/page/{p}/"
                try:
                    res = await crawler.arun(url=url, config=run_cfg)
                except Exception as e:
                    print(f"❌ Page {p} failed outright: {e}")
                    continue
    
    
                if not res.extracted_content:
                    print(f"❌ Page {p} returned no content, skipping")
                    continue
    
    
                try:
                    items = json.loads(res.extracted_content)
                except Exception as e:
                    print(f"❌ Page {p} JSON‑parse error: {e}")
                    continue
    
    
                print(f"✅ Page {p}: {len(items)} quotes")
                all_items.extend(items)
    
    
        return pd.DataFrame(all_items)

    Now, this asynchronous function orchestrates the HTTP‑only crawl: it spins up an AsyncWebCrawler with our AsyncHTTPCrawlerStrategy, iterates through each page URL, and safely awaits crawler.arun(), handles any request or JSON parsing errors and collects the extracted quote records into a single pandas DataFrame for downstream analysis.

    Copy CodeCopiedUse a different Browser
    df = asyncio.get_event_loop().run_until_complete(crawl_quotes_http(max_pages=3))
    df.head()
    

    Finally, we kick off the crawl_quotes_http coroutine on Colab’s existing asyncio loop, fetching three pages of quotes, and then display the first few rows of the resulting pandas DataFrame to verify that our crawler returned structured data as expected.

    In conclusion, by combining Google Colab’s zero-config environment with Python’s asynchronous ecosystem and Crawl4AI’s flexible crawling strategies, we have now developed a fully automated pipeline for scraping and structuring web data in minutes. Whether you need to spin up a quick dataset of quotes, build a refreshable news‑article archive, or power a RAG workflow, Crawl4AI’s blend of httpx, asyncio, JsonCssExtractionStrategy, and AsyncHTTPCrawlerStrategy delivers both simplicity and scalability. Beyond pure HTTP crawls, you can instantly pivot to Playwright‑driven browser automation without rewriting your extraction logic, underscoring why Crawl4AI stands out as the go‑to framework for modern, production‑ready web data extraction.


    Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

    🔥 [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

    The post A Coding Guide to Asynchronous Web Data Extraction Using Crawl4AI: An Open-Source Web Crawling and Scraping Toolkit Designed for LLM Workflows appeared first on MarkTechPost.

    Source: Read More 

    Facebook Twitter Reddit Email Copy Link
    Previous ArticleA New Citibank Report/Guide Shares How Agentic AI Will Reshape Finance with Autonomous Analysis and Intelligent Automation
    Next Article Sequential-NIAH: A Benchmark for Evaluating LLMs in Extracting Sequential Information from Long Texts

    Related Posts

    Machine Learning

    How to Evaluate Jailbreak Methods: A Case Study with the StrongREJECT Benchmark

    June 10, 2025
    Machine Learning

    VeBrain: A Unified Multimodal AI Framework for Visual Reasoning and Real-World Robotic Control

    June 10, 2025
    Leave A Reply Cancel Reply

    For security, use of Google's reCAPTCHA service is required which is subject to the Google Privacy Policy and Terms of Use.

    Continue Reading

    CVE-2025-4487 – iSourcecode Gym Management System SQL Injection Vulnerability

    Common Vulnerabilities and Exposures (CVEs)

    How to create a Lottie text animation

    Web Development

    The best online video editors of 2025: Expert tested

    News & Updates

    CVE-2022-26424 – Apache Struts Command Injection

    Common Vulnerabilities and Exposures (CVEs)

    Highlights

    CVE-2025-48842 – Apache HTTP Server Cross-Site Request Forgery

    May 28, 2025

    CVE ID : CVE-2025-48842

    Published : May 28, 2025, 4:15 a.m. | 44 minutes ago

    Description : Rejected reason: Not used

    Severity: 0.0 | NA

    Visit the link for more details, such as CVSS details, affected products, timeline, and more…

    Horabot Malware Targets 6 Latin American Nations Using Invoice-Themed Phishing Emails

    May 14, 2025

    CVE-2025-4897 – Tenda A15 HTTP POST Request Handler Buffer Overflow

    May 18, 2025

    How to get Google’s new Pixel 9a for free

    April 11, 2025
    © DevStackTips 2025. All rights reserved.
    • Contact
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.