SealQA：提升搜索增强语言模型推理能力的新标杆

摘要

我们推出了SealQA，这是一个全新的挑战基准，旨在评估基于搜索增强的语言模型在事实查询问题上的表现，特别是在网络搜索产生矛盾、噪声或无用结果的情况下。SealQA包含三种变体：(1) Seal-0（主要）和(2) Seal-Hard，它们分别评估事实准确性和推理能力，其中Seal-0专注于那些聊天模型（如GPT-4.1）通常准确率接近零的最具挑战性问题；(3) LongSeal，它将SealQA扩展到测试长上下文、多文档推理的“大海捞针”场景。我们的评估揭示了当前模型的关键局限：即便是前沿的大型语言模型（LLMs）在SealQA的所有变体上表现均不佳。在Seal-0上，配备了o3和o4-mini等工具的前沿代理模型，在其最佳推理努力下，准确率分别仅为17.1%和6.3%。我们发现，如DeepSeek-R1-671B和o3-mini这样的高级推理模型对噪声搜索结果极为敏感。值得注意的是，在o3-mini、o4-mini和o3上增加测试时的计算资源并未带来可靠的性能提升，性能往往早早就达到平台期甚至下降。此外，尽管近期模型受“中间迷失”问题的影响较小，但在LongSeal中面对大量干扰项时，它们仍无法可靠地识别相关文档。为了促进未来研究，我们已在huggingface.co/datasets/vtllms/sealqa发布了SealQA。

English

We introduce SealQA, a new challenge benchmark for evaluating SEarch-Augmented Language models on fact-seeking questions where web search yields conflicting, noisy, or unhelpful results. SealQA comes in three flavors: (1) Seal-0 (main) and (2) Seal-Hard, which assess factual accuracy and reasoning capabilities, with Seal-0 focusing on the most challenging questions where chat models (e.g., GPT-4.1) typically achieve near-zero accuracy; and (3) LongSeal, which extends SealQA to test long-context, multi-document reasoning in "needle-in-a-haystack" settings. Our evaluation reveals critical limitations in current models: Even frontier LLMs perform poorly across all SealQA flavors. On Seal-0, frontier agentic models equipped with tools like o3 and o4-mini achieve only 17.1% and 6.3% accuracy, respectively, at their best reasoning efforts. We find that advanced reasoning models such as DeepSeek-R1-671B and o3-mini are highly vulnerable to noisy search results. Notably, increasing test-time compute does not yield reliable gains across o3-mini, o4-mini, and o3, with performance often plateauing or even declining early. Additionally, while recent models are less affected by the "lost-in-the-middle" issue, they still fail to reliably identify relevant documents in LongSeal when faced with numerous distractors. To facilitate future work, we release SealQA at huggingface.co/datasets/vtllms/sealqa.

SealQA：提升搜索增强语言模型推理能力的新标杆

SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

摘要

Support