ChatPaper.aiChatPaper

Vibe Checker:将代码评估与人类偏好对齐

Vibe Checker: Aligning Code Evaluation with Human Preference

October 8, 2025
作者: Ming Zhong, Xiang Zhou, Ting-Yun Chang, Qingze Wang, Nan Xu, Xiance Si, Dan Garrette, Shyam Upadhyay, Jeremiah Liu, Jiawei Han, Benoit Schillings, Jiao Sun
cs.AI

摘要

大型语言模型(LLMs)推动了“氛围编程”的兴起,用户通过自然语言交互利用LLMs生成并迭代优化代码,直至其通过“氛围检验”。氛围检验与真实世界的人类偏好紧密相关,超越了单纯的功能性:解决方案应感觉恰当、代码清晰、意图得以保留且保持正确。然而,当前的代码评估仍固守于pass@k指标,仅捕捉功能正确性,忽视了用户日常应用的非功能性指令。本文假设,指令遵循是构成氛围检验中代表人类编程偏好的缺失环节,它超越了功能正确性。为了量化模型遵循代码指令的能力,我们引入了VeriCode,一个包含30种可验证代码指令的分类体系及其对应的确定性验证器。我们利用该分类体系扩充了现有的评估套件,创建了Vibe Checker,一个同时评估代码指令遵循与功能正确性的测试平台。通过对31个领先LLMs的评估,我们发现即使是最强大的模型也难以同时遵循多项指令,并表现出明显的功能退化。最重要的是,功能正确性与指令遵循的综合评分与人类偏好最为相关,其中后者在现实编程任务中成为主要区分因素。我们的研究揭示了氛围检验的核心要素,为基准测试和开发更符合用户编程偏好的模型提供了具体路径。
English
Large Language Models (LLMs) have catalyzed vibe coding, where users leverage LLMs to generate and iteratively refine code through natural language interactions until it passes their vibe check. Vibe check is tied to real-world human preference and goes beyond functionality: the solution should feel right, read cleanly, preserve intent, and remain correct. However, current code evaluation remains anchored to pass@k and captures only functional correctness, overlooking the non-functional instructions that users routinely apply. In this paper, we hypothesize that instruction following is the missing piece underlying vibe check that represents human preference in coding besides functional correctness. To quantify models' code instruction following capabilities with measurable signals, we present VeriCode, a taxonomy of 30 verifiable code instructions together with corresponding deterministic verifiers. We use the taxonomy to augment established evaluation suites, resulting in Vibe Checker, a testbed to assess both code instruction following and functional correctness. Upon evaluating 31 leading LLMs, we show that even the strongest models struggle to comply with multiple instructions and exhibit clear functional regression. Most importantly, a composite score of functional correctness and instruction following correlates the best with human preference, with the latter emerging as the primary differentiator on real-world programming tasks. Our work identifies core factors of the vibe check, providing a concrete path for benchmarking and developing models that better align with user preferences in coding.
PDF292October 9, 2025