SATA-BENCH：多选问题中的“全选适用”基准测试

摘要

大型语言模型（LLMs）越来越多地在单一答案的多项选择题任务上进行评估，然而许多现实世界的问题需要从一组选项中识别出所有正确答案。这种能力仍未被充分探索。我们推出了SATA-BENCH，这是首个专门用于评估LLMs在“选择所有适用项”（SATA）问题上的基准测试，涵盖阅读理解、法律和生物医学等多个领域。我们对27个开源和专有模型的评估揭示了一个显著差距：即使是最强大的模型，其精确匹配率也仅为41.8%，暴露了LLMs在可靠识别所有正确答案方面的不足。我们发现，这一弱点源于两个核心挑战：选择偏差——模型倾向于某些选项而忽略内容，以及数量偏差——模型无法预测正确答案的数量。为解决这些问题，我们提出了Choice Funnel，一种结合了令牌去偏和自适应阈值的解码策略，以引导模型做出完整且准确的选择。Choice Funnel在精确匹配率上比竞争基线高出最多29%，同时将推理成本降低超过64%。我们的研究揭示了当前LLMs的根本局限性，并引入了一个新的框架来诊断和改进多答案推理。我们发布SATA-BENCH和Choice Funnel，以促进LLM在现实多答案应用中的稳健决策能力发展。

English

Large language models (LLMs) are increasingly evaluated on single-answer multiple-choice tasks, yet many real-world problems require identifying all correct answers from a set of options. This capability remains underexplored. We introduce SATA-BENCH, the first dedicated benchmark for evaluating LLMs on Select All That Apply (SATA) questions across diverse domains, including reading comprehension, law, and biomedicine. Our evaluation of 27 open-source and proprietary models reveals a significant gap: even the strongest model achieves only 41.8% exact match, exposing LLMs' inability to reliably identify all correct answers. We find that this weakness stems from two core challenges: selection bias - models favor certain choices regardless of content, and count bias - models fail to predict the correct number of answers. To address these issues, we propose Choice Funnel, a decoding strategy that combines token debiasing with adaptive thresholding to guide models toward complete and accurate selections. Choice Funnel achieves up to 29% higher exact match than competitive baselines while reducing inference cost by over 64%. Our findings expose fundamental limitations in current LLMs and introduce a new framework for diagnosing and improving multi-answer reasoning. We release SATA-BENCH and Choice Funnel to promote LLM development for robust decision-making in realistic, multi-answer applications.

SATA-BENCH：多选问题中的“全选适用”基准测试

SATA-BENCH: Select All That Apply Benchmark for Multiple Choice Questions

摘要

Support