深層研究におけるテキストランキングの再検討

要旨

深層研究は、広範なオープンウェブ探索を通じて困難なクエリに対処することを目的とした重要な課題として登場している。この課題に取り組むため、従来の研究の多くは、大規模言語モデル（LLM）ベースのエージェントに不透明なウェブ検索APIを装備し、エージェントが検索クエリを反復的に発行し、外部証拠を取得し、それに基づいて推論することを可能にしてきた。深層研究における検索の重要な役割にもかかわらず、ブラックボックス的なウェブ検索APIは検索コンポーネントの体系的分析を妨げ、確立されたテキスト順位付け手法の深層研究における挙動をほぼ不明瞭にしている。このギャップを埋めるため、我々は深層研究設定におけるIRテキスト順位付け手法の主要知見とベストプラクティスの選択的再現を行う。特に、(i) 検索単位（文書対パッセージ）、(ii) パイプライン構成（異なる検索器、再順位付け器、再順位付け深度）、(iii) クエリ特性（エージェント発行クエリとテキスト順位付け器の学習クエリとの不一致）の3つの観点からその有効性を検証する。固定コーパスを持つ深層研究データセットであるBrowseComp-Plus上で実験を行い、多様な設定において2つのオープンソースエージェント、5つの検索器、3つの再順位付け器を評価した。その結果、エージェント発行クエリは一般にウェブ検索スタイルの構文（例：引用符付き完全一致）に従い、語彙的検索器、学習済みスパース検索器、マルチベクトル検索器で有利に働くこと、限られたコンテキストウィンドウ下ではパッセージ単位の方が効率的であり、語彙的検索における文書長正規化の困難を回避できること、再順位付けが極めて有効であること、エージェント発行クエリを自然言語質問に変換することでクエリ不一致が大幅に解消されることが明らかになった。

English

Deep research has emerged as an important task that aims to address hard queries through extensive open-web exploration. To tackle it, most prior work equips large language model (LLM)-based agents with opaque web search APIs, enabling agents to iteratively issue search queries, retrieve external evidence, and reason over it. Despite search's essential role in deep research, black-box web search APIs hinder systematic analysis of search components, leaving the behaviour of established text ranking methods in deep research largely unclear. To fill this gap, we reproduce a selection of key findings and best practices for IR text ranking methods in the deep research setting. In particular, we examine their effectiveness from three perspectives: (i) retrieval units (documents vs. passages), (ii) pipeline configurations (different retrievers, re-rankers, and re-ranking depths), and (iii) query characteristics (the mismatch between agent-issued queries and the training queries of text rankers). We perform experiments on BrowseComp-Plus, a deep research dataset with a fixed corpus, evaluating 2 open-source agents, 5 retrievers, and 3 re-rankers across diverse setups. We find that agent-issued queries typically follow web-search-style syntax (e.g., quoted exact matches), favouring lexical, learned sparse, and multi-vector retrievers; passage-level units are more efficient under limited context windows, and avoid the difficulties of document length normalisation in lexical retrieval; re-ranking is highly effective; translating agent-issued queries into natural-language questions significantly bridges the query mismatch.

深層研究におけるテキストランキングの再検討

Revisiting Text Ranking in Deep Research

要旨

Support