Vibe Checker:將程式碼評估與人類偏好對齊
Vibe Checker: Aligning Code Evaluation with Human Preference
October 8, 2025
作者: Ming Zhong, Xiang Zhou, Ting-Yun Chang, Qingze Wang, Nan Xu, Xiance Si, Dan Garrette, Shyam Upadhyay, Jeremiah Liu, Jiawei Han, Benoit Schillings, Jiao Sun
cs.AI
摘要
大型語言模型(LLMs)已催生了「氛圍編碼」的實踐,使用者透過自然語言互動,利用LLMs生成並迭代精煉代碼,直至其通過他們的「氛圍檢驗」。氛圍檢驗與現實世界的人類偏好緊密相連,並超越了功能性:解決方案應感覺正確、閱讀清晰、保留意圖且保持準確。然而,當前的代碼評估仍固守於pass@k指標,僅捕捉功能正確性,忽略了使用者日常應用的非功能性指令。本文假設,指令遵循是構成氛圍檢驗中代表人類編碼偏好的缺失環節,除了功能正確性之外。為了量化模型遵循代碼指令的能力並提供可測量的信號,我們提出了VeriCode,一個包含30種可驗證代碼指令的分類體系及其對應的確定性驗證器。我們利用這一分類體系擴充了現有的評估套件,從而創建了Vibe Checker,一個用於評估代碼指令遵循與功能正確性的測試平台。通過對31個領先LLMs的評估,我們發現即使最強的模型在遵循多條指令方面也存在困難,並顯示出明顯的功能退化。最重要的是,功能正確性與指令遵循的綜合評分與人類偏好最為相關,後者在現實編程任務中成為主要的區分因素。我們的工作識別了氛圍檢驗的核心要素,為基準測試和開發更符合使用者編碼偏好的模型提供了具體路徑。
English
Large Language Models (LLMs) have catalyzed vibe coding, where users leverage
LLMs to generate and iteratively refine code through natural language
interactions until it passes their vibe check. Vibe check is tied to real-world
human preference and goes beyond functionality: the solution should feel right,
read cleanly, preserve intent, and remain correct. However, current code
evaluation remains anchored to pass@k and captures only functional correctness,
overlooking the non-functional instructions that users routinely apply. In this
paper, we hypothesize that instruction following is the missing piece
underlying vibe check that represents human preference in coding besides
functional correctness. To quantify models' code instruction following
capabilities with measurable signals, we present VeriCode, a taxonomy of 30
verifiable code instructions together with corresponding deterministic
verifiers. We use the taxonomy to augment established evaluation suites,
resulting in Vibe Checker, a testbed to assess both code instruction following
and functional correctness. Upon evaluating 31 leading LLMs, we show that even
the strongest models struggle to comply with multiple instructions and exhibit
clear functional regression. Most importantly, a composite score of functional
correctness and instruction following correlates the best with human
preference, with the latter emerging as the primary differentiator on
real-world programming tasks. Our work identifies core factors of the vibe
check, providing a concrete path for benchmarking and developing models that
better align with user preferences in coding.