SealQA: 検索拡張言語モデルの推論能力の新たな基準を確立する

要旨

私たちはSealQAを紹介します。これは、ウェブ検索が矛盾した、ノイズの多い、または役に立たない結果をもたらす事実探求型の質問に対して、検索拡張言語モデルを評価するための新しいチャレンジベンチマークです。SealQAには3つのバリエーションがあります：(1) Seal-0（メイン）と(2) Seal-Hardで、これらは事実の正確性と推論能力を評価し、Seal-0はチャットモデル（例：GPT-4.1）が通常ほぼゼロの精度しか達成できない最も難しい質問に焦点を当てています。そして(3) LongSealで、これはSealQAを拡張し、「干し草の山の中の針」設定での長文脈、複数ドキュメントの推論をテストします。私たちの評価は、現在のモデルの重要な限界を明らかにしています：最先端のLLMでさえ、すべてのSealQAバリエーションでパフォーマンスが低いです。Seal-0では、o3やo4-miniのようなツールを備えた最先端のエージェントモデルは、最善の推論努力でもそれぞれ17.1％と6.3％の精度しか達成しません。私たちは、DeepSeek-R1-671Bやo3-miniのような高度な推論モデルがノイズの多い検索結果に非常に脆弱であることを発見しました。特に、テスト時の計算量を増やしても、o3-mini、o4-mini、o3全体で信頼性のある向上は得られず、パフォーマンスがしばしば頭打ちになるか、早期に低下することさえあります。さらに、最近のモデルは「真ん中で迷子になる」問題の影響を受けにくくなっていますが、LongSealでは、多数のディストラクターに直面した場合、関連するドキュメントを確実に特定することにまだ失敗します。今後の研究を促進するために、私たちはSealQAをhuggingface.co/datasets/vtllms/sealqaで公開します。

English

We introduce SealQA, a new challenge benchmark for evaluating SEarch-Augmented Language models on fact-seeking questions where web search yields conflicting, noisy, or unhelpful results. SealQA comes in three flavors: (1) Seal-0 (main) and (2) Seal-Hard, which assess factual accuracy and reasoning capabilities, with Seal-0 focusing on the most challenging questions where chat models (e.g., GPT-4.1) typically achieve near-zero accuracy; and (3) LongSeal, which extends SealQA to test long-context, multi-document reasoning in "needle-in-a-haystack" settings. Our evaluation reveals critical limitations in current models: Even frontier LLMs perform poorly across all SealQA flavors. On Seal-0, frontier agentic models equipped with tools like o3 and o4-mini achieve only 17.1% and 6.3% accuracy, respectively, at their best reasoning efforts. We find that advanced reasoning models such as DeepSeek-R1-671B and o3-mini are highly vulnerable to noisy search results. Notably, increasing test-time compute does not yield reliable gains across o3-mini, o4-mini, and o3, with performance often plateauing or even declining early. Additionally, while recent models are less affected by the "lost-in-the-middle" issue, they still fail to reliably identify relevant documents in LongSeal when faced with numerous distractors. To facilitate future work, we release SealQA at huggingface.co/datasets/vtllms/sealqa.

SealQA: 検索拡張言語モデルの推論能力の新たな基準を確立する

SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

要旨

Support