OpenBioRQ：面向智能体的未解决生物医学研究问题

摘要

一个有效的引文看似是证据，但链接可解析并不意味着被引论文确实支持该论断。我发现当前的智能体模型极少捏造引文（解析率超过99%），但约有15.9%的引文指向了错误的论文。现有基准测试遗漏了这一失效模式：当问题存在固定答案密钥时，模型可以从该密钥中复现预期来源，而非独立验证该来源是否支持论断。为此，我提出了\openbiorq{}——一个基于检索驱动的智能体基准测试，涵盖12个领域中的12,553个未解决的生物医学研究问题，将开放性问题作为忠实性与弃权探针。据我所知，这是首个将智能体场景（模型必须执行多次工具调用）与无答案密钥的未解决问题相结合的生物医学基准测试。开放性的验证基于真实的后续证据，而非模型的参数化知识。难度是经验性的：我以三个开放权重参考模型未能回答的问题作为难度锚点，而非依赖主观难度标签。在这一最难子集上，与难度锚点同源的保留模型仅能解决约17%的问题，而三个独立的前沿智能体（Gemini-3-Pro、Opus-4.7、GPT-5.5）的解决率跨度达29-60%。因此，该基准测试难度高、未饱和（最优智能体仍有约33-40%的问题未解决），并能有效区分能力层级。除难度外，我观察到在最难问题上出现了智能体崩溃现象，即智能体停止使用工具。对于最容易崩溃的模型，完全禁用工具后其得分几乎没有变化——因此工具在最需要它们的地方却失去了效用。采用每个问题固定化的检查清单后，评审者间的一致性从斯皮尔曼相关系数0.35提升至0.82。

English

A working citation looks like proof -- but the fact that a link resolves does not mean the cited paper supports the claim. I find that current agentic models rarely fabricate citations (over 99% resolve), yet roughly 15.9% link to the wrong paper. Existing benchmarks miss this failure mode: when a question has a fixed answer key, a model can reproduce the expected source from that key rather than independently verifying that the source supports the claim. I introduce \openbiorq{}, a retrieval-grounded agentic benchmark of 12{,}553 unsolved biomedical research questions across 12 domains that treats open questions as a faithfulness-and-abstention probe. To my knowledge, this is the first biomedical benchmark to combine an agentic setting -- where the model must issue multiple tool calls -- with unsolved questions that have no answer key. Openness is verified against real follow-up evidence rather than a model's parametric knowledge. Difficulty is empirical: I anchor it on questions that three open-weight reference models fail to answer, rather than on subjective hardness labels. On this hardest subset, held-out models from the same lineage as the difficulty anchors solve only ~17%, while three independent frontier agents (Gemini-3-Pro, Opus-4.7, GPT-5.5) span a wide 29-60% range. The benchmark is thus hard, non-saturating (the best agent still leaves ~33-40\% unsolved), and discriminating across capability tiers. Beyond difficulty, I observe agentic collapse on the hardest questions, where agents stop using their tools. For the most collapse-prone model, blocking tool access entirely barely changes its score -- so tools stop paying off exactly where they are needed most. A frozen per-question checklist raises inter-judge agreement from Spearman 0.35 to 0.82.