ChatPaper.aiChatPaper

OpenBioRQ:面向智能体的未解决生物医学研究问题

OpenBioRQ: Unsolved Biomedical Research Questions for Agents

June 20, 2026
作者: Minbyul Jeong
cs.AI

摘要

一个有效的引文看似是证据,但链接可解析并不意味着被引论文确实支持该论断。我发现当前的智能体模型极少捏造引文(解析率超过99%),但约有15.9%的引文指向了错误的论文。现有基准测试遗漏了这一失效模式:当问题存在固定答案密钥时,模型可以从该密钥中复现预期来源,而非独立验证该来源是否支持论断。为此,我提出了\openbiorq{}——一个基于检索驱动的智能体基准测试,涵盖12个领域中的12,553个未解决的生物医学研究问题,将开放性问题作为忠实性与弃权探针。据我所知,这是首个将智能体场景(模型必须执行多次工具调用)与无答案密钥的未解决问题相结合的生物医学基准测试。开放性的验证基于真实的后续证据,而非模型的参数化知识。难度是经验性的:我以三个开放权重参考模型未能回答的问题作为难度锚点,而非依赖主观难度标签。在这一最难子集上,与难度锚点同源的保留模型仅能解决约17%的问题,而三个独立的前沿智能体(Gemini-3-Pro、Opus-4.7、GPT-5.5)的解决率跨度达29-60%。因此,该基准测试难度高、未饱和(最优智能体仍有约33-40%的问题未解决),并能有效区分能力层级。除难度外,我观察到在最难问题上出现了智能体崩溃现象,即智能体停止使用工具。对于最容易崩溃的模型,完全禁用工具后其得分几乎没有变化——因此工具在最需要它们的地方却失去了效用。采用每个问题固定化的检查清单后,评审者间的一致性从斯皮尔曼相关系数0.35提升至0.82。
English
A working citation looks like proof -- but the fact that a link resolves does not mean the cited paper supports the claim. I find that current agentic models rarely fabricate citations (over 99% resolve), yet roughly 15.9% link to the wrong paper. Existing benchmarks miss this failure mode: when a question has a fixed answer key, a model can reproduce the expected source from that key rather than independently verifying that the source supports the claim. I introduce \openbiorq{}, a retrieval-grounded agentic benchmark of 12{,}553 unsolved biomedical research questions across 12 domains that treats open questions as a faithfulness-and-abstention probe. To my knowledge, this is the first biomedical benchmark to combine an agentic setting -- where the model must issue multiple tool calls -- with unsolved questions that have no answer key. Openness is verified against real follow-up evidence rather than a model's parametric knowledge. Difficulty is empirical: I anchor it on questions that three open-weight reference models fail to answer, rather than on subjective hardness labels. On this hardest subset, held-out models from the same lineage as the difficulty anchors solve only ~17%, while three independent frontier agents (Gemini-3-Pro, Opus-4.7, GPT-5.5) span a wide 29-60% range. The benchmark is thus hard, non-saturating (the best agent still leaves ~33-40\% unsolved), and discriminating across capability tiers. Beyond difficulty, I observe agentic collapse on the hardest questions, where agents stop using their tools. For the most collapse-prone model, blocking tool access entirely barely changes its score -- so tools stop paying off exactly where they are needed most. A frozen per-question checklist raises inter-judge agreement from Spearman 0.35 to 0.82.