超越单语言深度研究：使用跨语言BrowseComp-Plus评估智能体和检索器

摘要

深度研究智能体正越来越多地以其搜索证据、对检索来源进行推理并生成有据可依的回答的能力来评估。然而，现有的浏览基准大多假设用户查询与支撑证据使用同一种语言，而未考虑当相关证据以另一种语言出现时，智能体搜索系统是否仍能有效运作。我们提出了 XBCP（跨语言 BrowseComp-Plus），这是一个受控基准，它保留了 BrowseComp-Plus 的英文问答空间，但改变了支撑文档的语言。XBCP 实现了两种互补的设置：在跨语言设置中，每个查询与一种指定语言的证据配对；在多语言设置中，完整的证据语料库均匀且随机地分布在 12 种语言中，涵盖高资源和低资源语言。我们使用稀疏和稠密的多语言检索器评估了四种深度研究智能体，衡量了答案准确性、证据召回率、搜索行为、校准度、引用准确度以及 oracle 检索。结果显示，当证据被翻译后，性能显著下降。即使是强大的稠密检索器也会损失证据召回率，智能体变得校准度更低，引用证据的可靠性也更差。值得注意的是，即使直接提供所有黄金证据，准确性仍然较低。这些发现表明，跨语言深度研究既暴露了检索失败的问题，也暴露了智能体在整合语言不匹配证据时存在的独立困难。

English

Deep research agents are increasingly evaluated on their ability to search for evidence, reason over retrieved sources, and produce grounded answers. Existing browsing benchmarks, however, largely assume that the user's query and the supporting evidence are written in the same language, leaving open whether agentic search systems can operate when relevant evidence appears in another language. We introduce XBCP (Cross-lingual BrowseComp-Plus), a controlled benchmark that preserves the English question-and-answer space of BrowseComp-Plus but varies the languages of the supporting documents. XBCP instantiates two complementary settings: in the cross-lingual setting, each query is paired with evidence in a single assigned language. In the multilingual setting, the full evidence corpus is distributed equally and randomly across 12 languages spanning high-resource and low-resource regimes. We evaluate four deep research agents using sparse and dense multilingual retrievers, measuring answer accuracy, evidence recall, search behavior, calibration, citation fidelity, and oracle retrieval. Results reveal substantial degradation when evidence is translated. Even strong, dense retrievers lose evidence recall, and agents become less calibrated and cite evidence less reliably. Notably, accuracy remains lower even when all gold evidence is supplied directly. These findings suggest that cross-lingual deep research exposes both retrieval failures and an independent, agent-side difficulty in integrating language-mismatched evidence.