ChatPaper.aiChatPaper

Qworld:面向大语言模型的特定问题评估标准

Qworld: Question-Specific Evaluation Criteria for LLMs

March 6, 2026
作者: Shanghua Gao, Yuchang Su, Pengwei Sui, Curtis Ginder, Marinka Zitnik
cs.AI

摘要

评估大型语言模型在开放性问题上的表现具有挑战性,因为回答质量高度依赖问题语境。二元评分和静态评估标准难以捕捉这些语境依赖型要求。现有方法通常在数据集层面定义标准,或采用单次生成模式,这限制了其探索每个问题所隐含评估空间的能力。我们提出"一问一世界"方法,通过递归扩展树生成针对特定问题的评估标准。该方法将给定问题分解为具体场景、多维视角和细粒度二元标准,形成结构化的纵向层级扩展与横向维度延伸。最终生成的标准明确定义了高质量回答需要涵盖的要素。在HealthBench测试集上,Qworld覆盖了专家制定标准的89%,并生成经人类专家验证的79%新颖标准。专家评定Qworld标准在洞察深度与粒度精细度上均优于现有方法。当应用于HealthBench和"人类终极考试"中的11个前沿LLM时,Qworld揭示了在长期影响、公平性、错误处理和跨学科推理等维度上的能力差异,这些是粗粒度标准无法区分的。通过将标准生成构建为对问题隐含评估轴的结构化覆盖,Qworld实现了适应每个独特问题的动态评估,而非依赖固定的任务级标准。
English
Evaluating large language models (LLMs) on open-ended questions is difficult because response quality depends on the question's context. Binary scores and static rubrics fail to capture these context-dependent requirements. Existing methods define criteria at the dataset level or generate them in a single pass, which limits their ability to explore the evaluation space implied by each question. We introduce One-Question-One-World (Qworld), a method that generates question-specific evaluation criteria using a recursive expansion tree. Given a question, Qworld decomposes it into scenarios, perspectives, and fine-grained binary criteria through structured hierarchical and horizontal expansion. The resulting criteria specify what a high-quality answer must address for that question. On HealthBench, Qworld covers 89% of expert-authored criteria and generates 79% novel criteria validated by human experts. Experts rate Qworld criteria higher in insight and granularity than those produced by prior methods. When applied to 11 frontier LLMs on HealthBench and Humanity's Last Exam, Qworld reveals capability differences in dimensions such as long-term impact, equity, error handling, and interdisciplinary reasoning that coarse rubrics do not distinguish. By formulating criteria generation as structured coverage of question-implied evaluation axes, Qworld enables evaluation that adapts to each question rather than relying on fixed task-level criteria.
PDF71March 27, 2026