바이브 체커: 코드 평가와 인간의 선호도 정렬

초록

대형 언어 모델(LLMs)은 사용자가 자연어 상호작용을 통해 코드를 생성하고 반복적으로 개선하여 자신의 '바이브 체크(vibe check)'를 통과할 때까지 다듬는 '바이브 코딩(vibe coding)'을 촉진시켰다. 바이브 체크는 실세계의 인간 선호도와 연관되어 있으며, 단순한 기능성 이상의 요소를 포함한다: 해결책은 적절하게 느껴져야 하고, 깔끔하게 읽혀야 하며, 의도를 보존하고 정확성을 유지해야 한다. 그러나 현재의 코드 평가는 여전히 pass@k에 기반을 두고 있으며, 기능적 정확성만을 포착하여 사용자가 일상적으로 적용하는 비기능적 지시사항을 간과하고 있다. 본 논문에서는 지시사항 준수가 바이브 체크의 핵심 요소이며, 기능적 정확성 외에도 코딩에서의 인간 선호도를 대표한다는 가설을 제시한다. 모델의 코드 지시사항 준수 능력을 측정 가능한 신호로 정량화하기 위해, 우리는 30개의 검증 가능한 코드 지시사항과 이에 상응하는 결정론적 검증기를 포함한 VeriCode 분류체계를 제안한다. 이 분류체계를 기존 평가 도구에 적용하여, 코드 지시사항 준수와 기능적 정확성을 모두 평가할 수 있는 Vibe Checker 테스트베드를 구축하였다. 31개의 주요 LLMs를 평가한 결과, 가장 강력한 모델들조차도 다중 지시사항을 준수하는 데 어려움을 겪으며 명백한 기능적 퇴보를 보임을 확인하였다. 가장 중요한 것은, 기능적 정확성과 지시사항 준수를 결합한 종합 점수가 인간 선호도와 가장 높은 상관관계를 보였으며, 실세계 프로그래밍 작업에서는 지시사항 준수가 주요 차별화 요소로 부각되었다는 점이다. 본 연구는 바이브 체크의 핵심 요소를 규명함으로써, 사용자 선호도와 더 잘 부합하는 모델을 벤치마킹하고 개발하기 위한 구체적인 방향을 제시한다.

English

Large Language Models (LLMs) have catalyzed vibe coding, where users leverage LLMs to generate and iteratively refine code through natural language interactions until it passes their vibe check. Vibe check is tied to real-world human preference and goes beyond functionality: the solution should feel right, read cleanly, preserve intent, and remain correct. However, current code evaluation remains anchored to pass@k and captures only functional correctness, overlooking the non-functional instructions that users routinely apply. In this paper, we hypothesize that instruction following is the missing piece underlying vibe check that represents human preference in coding besides functional correctness. To quantify models' code instruction following capabilities with measurable signals, we present VeriCode, a taxonomy of 30 verifiable code instructions together with corresponding deterministic verifiers. We use the taxonomy to augment established evaluation suites, resulting in Vibe Checker, a testbed to assess both code instruction following and functional correctness. Upon evaluating 31 leading LLMs, we show that even the strongest models struggle to comply with multiple instructions and exhibit clear functional regression. Most importantly, a composite score of functional correctness and instruction following correlates the best with human preference, with the latter emerging as the primary differentiator on real-world programming tasks. Our work identifies core factors of the vibe check, providing a concrete path for benchmarking and developing models that better align with user preferences in coding.

바이브 체커: 코드 평가와 인간의 선호도 정렬

Vibe Checker: Aligning Code Evaluation with Human Preference

초록

Support