HtmlRAG: Enhancing RAG Systems with Richer Semantic and Structural Information through HTML

Retrieval-augmented generation (RAG) has been shown to improve knowledge capabilities and reduce the hallucination problem of LLMs. The Web is a major source of external knowledge used in RAG and many commercial systems such as ChatGPT. However, current RAG implementations face a fundamental challenge in their knowledge-processing approach. The conventional method of converting HTML documents into plain text before feeding them to LLMs results in a substantial loss of structural and semantic information. This limitation becomes evident when dealing with complex web content like tables, where the conversion process disrupts the original format and discards crucial HTML tags that carry important contextual information.

The existing methods to enhance RAG systems have focused on various components and frameworks. Traditional RAG pipelines use elements like query rewriters, retrievers, re-rankers, refiners, and readers, as implemented in frameworks like LangChain and LlamaIndex. The Post-retrieval processing method is explored through chunking-based and abstractive refiners to optimize the content sent to LLMs. Moreover, research in structured data understanding has demonstrated the superior information richness of HTML and Excel tables compared to plain text. However, these existing solutions face limitations when dealing with HTML content, as traditional chunking methods cannot effectively handle HTML structure, and abstractive refiners struggle with long HTML content and have high computational costs.

Researchers from the Gaoling School of Artificial Intelligence, Renmin University of China, and Baichuan Intelligent Technology, China have proposed HtmlRAG, a method that uses HTML instead of plain text as the format of retrieved knowledge in RAG systems to preserve richer semantic and structured information that is missing in plain text. This method utilizes recent advances in LLMsâ€™ context window capabilities and the versatility of HTML as a format that can accommodate various document types like LaTeX, PDF, and Word with minimal information loss. Moreover, the researchers identified significant challenges in implementing this approach, particularly the extensive token length of raw HTML documents and the presence of noise in the CSS styles, JavaScript, and comments format, which comprise over 90% of the tokens.

HtmlRAG implements a two-step pruning mechanism to process retrieved HTML documents efficiently. Initially, the system concatenates all retrieved HTML documents and parses them into a single DOM tree using Beautiful Soup. To address the computational challenges posed by the fine-grained nature of traditional DOM trees, the researchers developed an optimized â€œblock treeâ€ structure. This approach allows for adjustable granularity controlled by a maxWords parameter. Moreover, the block tree construction process recursively merges fragmented child nodes into their parent nodes, creating larger blocks while maintaining the word limit constraint. The pruning process then operates in two distinct phases: the first utilizes an embedding model to process the cleaned HTML output, followed by a generative model for further refinement.

The results show HtmlRAGâ€™s superior performance across six datasets outperforming baseline methods in all evaluation metrics. The results show limited utilization of structural information compared to HtmlRAG while examining chunking-based refiners that follow LangChainâ€™s approach. Among re-rankers, dense retrievers outperformed the sparse retriever BM25, with the encoder-based BGE showing better results than the decoder-based e5-mistral. Moreover, the abstractive refiners show notable limitations: LongLLMLingua struggles with HTML document optimization and lost structural information in plain text conversion, while JinaAI-reader, despite generating refined Markdown from HTML input, faced challenges with token-by-token decoding and high computational demands for long sequences.

In conclusion, researchers have introduced an approach called HtmlRAG that uses HTML as the format of retrieved knowledge in RAG systems to preserve rich semantic and structured information not present in plain text. The implemented HTML cleaning and pruning techniques effectively manage token length while preserving essential structural and semantic information. HtmlRAGâ€™s superior performance compared to traditional plain-text-based post-retrieval processes validates the effectiveness of utilizing HTML format for knowledge retrieval. The researchers provide an immediate practical solution and establish a promising new direction for future developments in RAG systems, encouraging further innovations in HTML-based knowledge retrieval and processing methods.

Check out the Paper. All credit for this research goes to the researchers of this project. Also,Â donâ€™t forget to follow us onÂ Twitter and join ourÂ Telegram Channel andÂ LinkedIn Group. If you like our work, you will love ourÂ newsletter.. Donâ€™t Forget to join ourÂ 55k+ ML SubReddit.

[AI Magazine/Report] Read Our Latest Report on â€˜SMALL LANGUAGE MODELSâ€˜

The post HtmlRAG: Enhancing RAG Systems with Richer Semantic and Structural Information through HTML appeared first on MarkTechPost.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

Minecraft licensing robbed us of this controversial NFL schedule release video

The power of generators

The power of generators

Simplify Factory Associations with Laravel’s UseFactory Attribute

This Week in Laravel: React Native, PhpStorm Junie, and more

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

HtmlRAG: Enhancing RAG Systems with Richer Semantic and Structural Information through HTML

Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

CVE-2025-47916 – Invision Community Themeeditor Remote Code Execution

Major Cybercrime Operation Nets Over 1,000 Arrests Across Africa

Coaching AI to write your code

DeepSeek AI Unveils DeepSeek-V3-0324: Blazing Fast Performance on Mac Studio, Heating Up the Competition with OpenAI

UKâ€™s ICO Warns Not to Ignore Data Privacy as â€˜My AIâ€™ Bot Investigation Concludes

Build Smarter Spring Boot Applications with Spring AI

Selenium WebDriver document.getElementById instead of send_keys

Kubernetes namespace restart logger: Podpulse [alpha]

Automate Your Workflow with Hidemyacc Automation

HtmlRAG: Enhancing RAG Systems with Richer Semantic and Structural Information through HTML

Related Posts