SATA-BENCH:多选问题中的“全选适用”基准测试
SATA-BENCH: Select All That Apply Benchmark for Multiple Choice Questions
May 31, 2025
作者: Weijie Xu, Shixian Cui, Xi Fang, Chi Xue, Stephanie Eckman, Chandan Reddy
cs.AI
摘要
大型语言模型(LLMs)越来越多地在单一答案的多项选择题任务上进行评估,然而许多现实世界的问题需要从一组选项中识别出所有正确答案。这种能力仍未被充分探索。我们推出了SATA-BENCH,这是首个专门用于评估LLMs在“选择所有适用项”(SATA)问题上的基准测试,涵盖阅读理解、法律和生物医学等多个领域。我们对27个开源和专有模型的评估揭示了一个显著差距:即使是最强大的模型,其精确匹配率也仅为41.8%,暴露了LLMs在可靠识别所有正确答案方面的不足。我们发现,这一弱点源于两个核心挑战:选择偏差——模型倾向于某些选项而忽略内容,以及数量偏差——模型无法预测正确答案的数量。为解决这些问题,我们提出了Choice Funnel,一种结合了令牌去偏和自适应阈值的解码策略,以引导模型做出完整且准确的选择。Choice Funnel在精确匹配率上比竞争基线高出最多29%,同时将推理成本降低超过64%。我们的研究揭示了当前LLMs的根本局限性,并引入了一个新的框架来诊断和改进多答案推理。我们发布SATA-BENCH和Choice Funnel,以促进LLM在现实多答案应用中的稳健决策能力发展。
English
Large language models (LLMs) are increasingly evaluated on single-answer
multiple-choice tasks, yet many real-world problems require identifying all
correct answers from a set of options. This capability remains underexplored.
We introduce SATA-BENCH, the first dedicated benchmark for evaluating LLMs on
Select All That Apply (SATA) questions across diverse domains, including
reading comprehension, law, and biomedicine. Our evaluation of 27 open-source
and proprietary models reveals a significant gap: even the strongest model
achieves only 41.8% exact match, exposing LLMs' inability to reliably identify
all correct answers. We find that this weakness stems from two core challenges:
selection bias - models favor certain choices regardless of content, and count
bias - models fail to predict the correct number of answers. To address these
issues, we propose Choice Funnel, a decoding strategy that combines token
debiasing with adaptive thresholding to guide models toward complete and
accurate selections. Choice Funnel achieves up to 29% higher exact match than
competitive baselines while reducing inference cost by over 64%. Our findings
expose fundamental limitations in current LLMs and introduce a new framework
for diagnosing and improving multi-answer reasoning. We release SATA-BENCH and
Choice Funnel to promote LLM development for robust decision-making in
realistic, multi-answer applications.Summary
AI-Generated Summary