超越單語深度研究：以跨語言BrowseComp-Plus評估代理與檢索器

摘要

深度研究代理在搜索證據、推理檢索來源以及生成有根據的答案方面的能力，正受到越來越多的評估。然而，現有的瀏覽基準測試大多假設用戶查詢與支持證據使用同一語言撰寫，這使得我們無法得知當相關證據出現在另一種語言時，代理搜索系統是否仍能運作。我們提出 XBCP（跨語言瀏覽競賽增強版），這是一個受控基準測試，保留了 BrowseComp-Plus 的英文問答空間，但改變了支持文件所用的語言。XBCP 包含兩種互補的設定：在跨語言設定中，每個查詢配對單一指定語言的證據；在多語言設定中，完整證據語料庫則均勻且隨機分布於 12 種語言中，涵蓋高資源與低資源語言。我們使用稀疏與稠密的多語言檢索器評估四個深度研究代理，衡量答案準確度、證據回憶率、搜索行為、校準程度、引用忠實度以及神諭檢索。結果顯示，當證據被翻譯時，效能顯著下降。即使是強大的稠密檢索器，其證據回憶率也會降低，代理的校準程度變差，引用證據的可靠性也下降。值得注意的是，即使直接提供所有黃金證據，準確度仍然較低。這些發現表明，跨語言深度研究不僅暴露了檢索失敗的問題，也揭示了代理在整合語言不匹配證據時所遇到的獨立困難。

English

Deep research agents are increasingly evaluated on their ability to search for evidence, reason over retrieved sources, and produce grounded answers. Existing browsing benchmarks, however, largely assume that the user's query and the supporting evidence are written in the same language, leaving open whether agentic search systems can operate when relevant evidence appears in another language. We introduce XBCP (Cross-lingual BrowseComp-Plus), a controlled benchmark that preserves the English question-and-answer space of BrowseComp-Plus but varies the languages of the supporting documents. XBCP instantiates two complementary settings: in the cross-lingual setting, each query is paired with evidence in a single assigned language. In the multilingual setting, the full evidence corpus is distributed equally and randomly across 12 languages spanning high-resource and low-resource regimes. We evaluate four deep research agents using sparse and dense multilingual retrievers, measuring answer accuracy, evidence recall, search behavior, calibration, citation fidelity, and oracle retrieval. Results reveal substantial degradation when evidence is translated. Even strong, dense retrievers lose evidence recall, and agents become less calibrated and cite evidence less reliably. Notably, accuracy remains lower even when all gold evidence is supplied directly. These findings suggest that cross-lingual deep research exposes both retrieval failures and an independent, agent-side difficulty in integrating language-mismatched evidence.