OpenBioRQ: エージェント向け未解決生物医学研究課題

要旨

動作する引用は証拠のように見えるが、リンクが解決されるという事実は、引用された論文が主張を支持していることを意味しない。現在のエージェントモデルが引用を捏造することはほとんどない（99％以上が解決される）が、約15.9％は誤った論文にリンクしていることがわかった。既存のベンチマークはこの失敗モードを見逃している。質問に固定された解答鍵がある場合、モデルはその鍵から期待されるソースを再現でき、ソースが主張を支持しているかを独立に検証しない。私はOpenBioRQを紹介する。これは12のドメインにわたる12,553の未解決の生物医学研究質問からなる検索基盤型エージェントベンチマークであり、未解決質問を忠実性と棄権のプローブとして扱う。私の知る限り、これはエージェント設定（モデルが複数のツール呼び出しを行う必要がある）と、解答鍵のない未解決質問を組み合わせた最初の生物医学ベンチマークである。未解決性は、モデルのパラメトリック知識ではなく、実際の追跡証拠に基づいて検証される。難易度は経験的に決定される。主観的な硬さのラベルではなく、3つのオープンウェイトの参照モデルが回答に失敗する質問に基づいて設定される。この最も難しいサブセットでは、難易度アンカーと同じ系統の保留モデルは約17％しか解けないのに対し、3つの独立したフロンティアエージェント（Gemini-3-Pro、Opus-4.7、GPT-5.5）は29～60％の広い範囲にわたる。したがって、このベンチマークは難しく、飽和せず（最良のエージェントでも約33～40％が未解決）、能力層間で識別力がある。難易度に加えて、最も難しい質問においてエージェント崩壊（エージェントがツールの使用を停止する）を観察した。最も崩壊しやすいモデルでは、ツールへのアクセスを完全にブロックしてもスコアがほとんど変わらない。つまり、ツールが必要とされるまさにその場でツールが効果を発揮しなくなる。固定された質問ごとのチェックリストにより、評価者間一致度がスピアマン相関係数0.35から0.82に向上する。

English

A working citation looks like proof -- but the fact that a link resolves does not mean the cited paper supports the claim. I find that current agentic models rarely fabricate citations (over 99% resolve), yet roughly 15.9% link to the wrong paper. Existing benchmarks miss this failure mode: when a question has a fixed answer key, a model can reproduce the expected source from that key rather than independently verifying that the source supports the claim. I introduce \openbiorq{}, a retrieval-grounded agentic benchmark of 12{,}553 unsolved biomedical research questions across 12 domains that treats open questions as a faithfulness-and-abstention probe. To my knowledge, this is the first biomedical benchmark to combine an agentic setting -- where the model must issue multiple tool calls -- with unsolved questions that have no answer key. Openness is verified against real follow-up evidence rather than a model's parametric knowledge. Difficulty is empirical: I anchor it on questions that three open-weight reference models fail to answer, rather than on subjective hardness labels. On this hardest subset, held-out models from the same lineage as the difficulty anchors solve only ~17%, while three independent frontier agents (Gemini-3-Pro, Opus-4.7, GPT-5.5) span a wide 29-60% range. The benchmark is thus hard, non-saturating (the best agent still leaves ~33-40\% unsolved), and discriminating across capability tiers. Beyond difficulty, I observe agentic collapse on the hardest questions, where agents stop using their tools. For the most collapse-prone model, blocking tool access entirely barely changes its score -- so tools stop paying off exactly where they are needed most. A frozen per-question checklist raises inter-judge agreement from Spearman 0.35 to 0.82.