거부 이유? 언어 모델과 판단의 정렬

초록

인간으로서 우리는 끊임없이 동료들과 상호작용하며 자연어 형태의 피드백을 받습니다. 이러한 언어적 피드백은 우리가 자신의 행동을 반성하고, 적절한 행동을 유지하며, 실수를 바로잡을 수 있게 해줍니다. 여기서 자연스럽게 제기되는 질문은, 이러한 언어적 피드백을 활용하여 대규모 언어 모델(LLM)을 조정할 수 있을까 하는 것입니다. 기존 연구들이 보상이나 선호도 데이터를 통해 LLM을 조정한 것과 달리, 본 연구는 언어적 피드백(즉, 판단)을 통해 LLM을 조정하는 첫 번째 체계적인 탐구를 제시합니다. 우리는 먼저 판단을 통해 LLM을 조정하기 위해 적응 가능한 잠재적 방법들에 대한 심층적인 조사를 시작했으며, 이러한 방법들이 판단을 완전히 활용하지 못한다는 점을 발견했습니다. 판단을 보다 효과적으로 활용하기 위해, 우리는 판단을 기반으로 세밀한 부적절한 내용 탐지 및 수정을 가능하게 하는 새로운 프레임워크인 Contrastive Unlikelihood Training(CUT)을 제안합니다. 오프라인 조정 결과에 따르면, 단 1317개의 기성 판단 데이터만으로도 CUT(LLaMA2-13b)은 175B DaVinci003을 능가하고 AlpacaEval에서 최고의 기준선을 52.34점 차이로 앞섰습니다. 온라인 조정 결과는 CUT이 모델 특화 판단 데이터를 사용하여 반복적인 방식으로 LLM(LLaMA2-chat-13b)을 조정할 수 있으며, AlpacaEval에서 81.09점에서 91.36점으로 꾸준한 성능 향상을 보여준다는 것을 입증했습니다. 우리의 분석은 더 나아가, 판단이 LLM 조정에 있어 보상보다 더 큰 잠재력을 보이며, 향후 연구가 필요하다는 점을 시사합니다.

English

As humans, we consistently engage in interactions with our peers and receive feedback in the form of natural language. This language feedback allows us to reflect on our actions, maintain appropriate behavior, and rectify our errors. The question arises naturally: can we use language feedback to align large language models (LLMs)? In contrast to previous research that aligns LLMs with reward or preference data, we present the first systematic exploration of alignment through the lens of language feedback (i.e., judgment). We commence with an in-depth investigation of potential methods that can be adapted for aligning LLMs with judgments, revealing that these methods are unable to fully capitalize on the judgments. To facilitate more effective utilization of judgments, we propose a novel framework, Contrastive Unlikelihood Training (CUT), that allows for fine-grained inappropriate content detection and correction based on judgments. Our offline alignment results show that, with merely 1317 off-the-shelf judgment data, CUT (LLaMA2-13b) can beat the 175B DaVinci003 and surpass the best baseline by 52.34 points on AlpacaEval. The online alignment results demonstrate that CUT can align LLMs (LLaMA2-chat-13b) in an iterative fashion using model-specific judgment data, with a steady performance improvement from 81.09 to 91.36 points on AlpacaEval. Our analysis further suggests that judgments exhibit greater potential than rewards for LLM alignment and warrant future research.

거부 이유? 언어 모델과 판단의 정렬

Reasons to Reject? Aligning Language Models with Judgments

초록

Support