HtmlRAG: HTMLはRAGシステムにおいて取得された知識をモデリングする際にプレーンテキストよりも優れている

要旨

Retrieval-Augmented Generation（RAG）は、LLMの知識能力を向上させ、幻覚問題を緩和することが示されています。Webは、RAGシステムで使用される外部知識の主要な情報源であり、ChatGPTやPerplexityなどの多くの商用システムが、Web検索エンジンを主要な検索システムとして使用しています。通常、このようなRAGシステムは検索結果を取得し、結果のHTMLソースをダウンロードしてから、HTMLソースからプレーンテキストを抽出します。プレーンテキストの文書やチャンクは、LLMに供給され、生成を補完します。ただし、プレーンテキストに基づくRAGプロセスでは、HTMLに固有の構造的および意味論的情報の多く、例えば見出しや表の構造などが失われます。この問題を緩和するために、私たちはHtmlRAGを提案します。これは、RAGにおいて取得された知識の形式としてプレーンテキストの代わりにHTMLを使用します。私たちは、HTMLが外部文書の知識をモデル化する際にプレーンテキストよりも優れていると考えており、ほとんどのLLMがHTMLを理解するための堅牢な能力を持っていると信じています。ただし、HTMLを利用することには新たな課題があります。HTMLには、タグ、JavaScript、CSSの仕様などの追加コンテンツが含まれており、これらはRAGシステムに追加の入力トークンとノイズをもたらします。この問題に対処するために、HTMLのクリーニング、圧縮、および剪定戦略を提案し、HTMLを短縮しながら情報の損失を最小限に抑えます。具体的には、無用なHTMLブロックを剪定し、HTMLの関連部分のみを保持する2段階のブロックツリーベースの剪定方法を設計しています。6つのQAデータセットでの実験は、RAGシステムでHTMLを使用することの優位性を確認しています。

English

Retrieval-Augmented Generation (RAG) has been shown to improve knowledge capabilities and alleviate the hallucination problem of LLMs. The Web is a major source of external knowledge used in RAG systems, and many commercial systems such as ChatGPT and Perplexity have used Web search engines as their major retrieval systems. Typically, such RAG systems retrieve search results, download HTML sources of the results, and then extract plain texts from the HTML sources. Plain text documents or chunks are fed into the LLMs to augment the generation. However, much of the structural and semantic information inherent in HTML, such as headings and table structures, is lost during this plain-text-based RAG process. To alleviate this problem, we propose HtmlRAG, which uses HTML instead of plain text as the format of retrieved knowledge in RAG. We believe HTML is better than plain text in modeling knowledge in external documents, and most LLMs possess robust capacities to understand HTML. However, utilizing HTML presents new challenges. HTML contains additional content such as tags, JavaScript, and CSS specifications, which bring extra input tokens and noise to the RAG system. To address this issue, we propose HTML cleaning, compression, and pruning strategies, to shorten the HTML while minimizing the loss of information. Specifically, we design a two-step block-tree-based pruning method that prunes useless HTML blocks and keeps only the relevant part of the HTML. Experiments on six QA datasets confirm the superiority of using HTML in RAG systems.

HtmlRAG: HTMLはRAGシステムにおいて取得された知識をモデリングする際にプレーンテキストよりも優れている

HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems

要旨

Support