SIN-Bench:长上下文多模态交错科学文献中的原生证据链追踪
SIN-Bench: Tracing Native Evidence Chains in Long-Context Multimodal Scientific Interleaved Literature
January 15, 2026
作者: Yiming Ren, Junjie Wang, Yuxin Meng, Yihang Shi, Zhiqiang Lin, Ruihang Chu, Yiran Xu, Ziming Li, Yunfei Zhao, Zihan Wang, Yu Qiao, Ruiming Tang, Minghao Liu, Yujiu Yang
cs.AI
摘要
评估多模态大语言模型是否真正理解长篇科学论文仍具挑战性:仅依赖答案匹配的指标和合成的"大海捞针"式测试往往只要求答案吻合,却无需模型在文档中建立因果关联的证据推理链条。我们提出"海洋寻踪"范式,要求模型在原始科学文献中构建显式的跨模态证据链。为实现该范式,我们构建了SIN-Data科学交错数据集,完整保留文本与插图的原始交织结构。基于此,我们设计了包含证据发现、假设验证、 grounded QA 和证据锚定摘要的四级渐进任务集SIN-Bench。我们进一步引入"无证据不评分"机制,仅当预测结果锚定于可验证证据时才予以计分,并通过匹配度、相关性和逻辑性诊断证据质量。在八个MLLM上的实验表明,证据锚定是主要瓶颈:Gemini-3-pro以0.573的平均分表现最佳,而GPT-5虽在SIN-QA答案准确率上达到0.767,但在证据对齐的综合评分中表现不佳,暴露出答案正确性与可追溯证据支持之间的脱节。
English
Evaluating whether multimodal large language models truly understand long-form scientific papers remains challenging: answer-only metrics and synthetic "Needle-In-A-Haystack" tests often reward answer matching without requiring a causal, evidence-linked reasoning trace in the document. We propose the "Fish-in-the-Ocean" (FITO) paradigm, which requires models to construct explicit cross-modal evidence chains within native scientific documents. To operationalize FITO, we build SIN-Data, a scientific interleaved corpus that preserves the native interleaving of text and figures. On top of it, we construct SIN-Bench with four progressive tasks covering evidence discovery (SIN-Find), hypothesis verification (SIN-Verify), grounded QA (SIN-QA), and evidence-anchored synthesis (SIN-Summary). We further introduce "No Evidence, No Score", scoring predictions when grounded to verifiable anchors and diagnosing evidence quality via matching, relevance, and logic. Experiments on eight MLLMs show that grounding is the primary bottleneck: Gemini-3-pro achieves the best average overall score (0.573), while GPT-5 attains the highest SIN-QA answer accuracy (0.767) but underperforms on evidence-aligned overall scores, exposing a gap between correctness and traceable support.