테스트 시간 선호도 최적화: 반복적인 텍스트 피드백을 통한 즉석 조정

초록

대형 언어 모델 (LLMs)은 인상적인 성능을 보여주지만, 다시 교육을 받지 않고는 인간의 선호도에 빠르게 적응할 수 있는 유연성이 부족합니다. 본 연구에서는 추론 중에 인간의 선호도와 일치시키는 Test-time Preference Optimization (TPO)이라는 프레임워크를 소개합니다. 이를 통해 모델 파라미터를 업데이트할 필요 없이 LLM 출력을 인간의 선호도에 맞추게 됩니다. TPO는 순수한 수치적 보상에 의존하는 대신 보상 신호를 텍스트적 비평으로 변환하고 이를 텍스트 보상으로 사용하여 응답을 반복적으로 개선합니다. 지시 따르기, 선호도 조정, 안전, 수학 등을 다루는 벤치마크 평가에서 TPO가 점차적으로 인간의 선호도와 일치도를 향상시킨다는 것을 보여줍니다. 특히, 몇 단계의 TPO 후에 초기에 일치하지 않았던 Llama-3.1-70B-SFT 모델이 일치한 대응 모델인 Llama-3.1-70B-Instruct를 능가할 수 있음을 확인했습니다. 더불어, TPO는 추론 중에 검색 폭과 깊이 모두 효율적으로 확장됩니다. 사례 연구를 통해 TPO가 LLM이 보상 신호를 해석하고 실행하는 능력을 활용하는 방법을 설명합니다. 우리의 연구 결과는 TPO를 테스트 시 선호도 최적화의 실용적이고 가벼운 대안으로 확립하며, 실시간으로 일치를 달성합니다. 우리의 코드는 https://github.com/yafuly/TPO에서 공개되어 있습니다.

English

Large language models (LLMs) demonstrate impressive performance but lack the flexibility to adapt to human preferences quickly without retraining. In this work, we introduce Test-time Preference Optimization (TPO), a framework that aligns LLM outputs with human preferences during inference, removing the need to update model parameters. Rather than relying on purely numerical rewards, TPO translates reward signals into textual critiques and uses them as textual rewards to iteratively refine its response. Evaluations on benchmarks covering instruction following, preference alignment, safety, and mathematics reveal that TPO progressively improves alignment with human preferences. Notably, after only a few TPO steps, the initially unaligned Llama-3.1-70B-SFT model can surpass the aligned counterpart, Llama-3.1-70B-Instruct. Furthermore, TPO scales efficiently with both the search width and depth during inference. Through case studies, we illustrate how TPO exploits the innate capacity of LLM to interpret and act upon reward signals. Our findings establish TPO as a practical, lightweight alternative for test-time preference optimization, achieving alignment on the fly. Our code is publicly available at https://github.com/yafuly/TPO.

테스트 시간 선호도 최적화: 반복적인 텍스트 피드백을 통한 즉석 조정

Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback

초록

Support