디코딩 시점 정렬을 위한 진화적 휴리스틱 기반 보상 조정

초록

LLM(대형 언어 모델)의 광범위한 적용 가능성과 점점 더 보편화되는 현상은 사용자와 이해관계자의 선호에 맞춰 LLM 응답을 조정할 필요성을 촉발시켰습니다. 이를 위해 많은 선호 최적화 접근법이 제안되었으며, 이들은 LLM 매개변수를 미세 조정하여 좋은 정렬을 달성하려 합니다. 그러나 이러한 매개변수 조정은 모델의 다양한 작업에서 성능을 저하시키는 것으로 알려져 있습니다. 또한, 변화하는 사용자 선호를 따라잡는 것은 이러한 상황에서 까다로운 문제입니다. 보상 모델 지도를 통한 디코딩 시점 정렬은 추론 시간 증가라는 비용을 치르면서 이러한 문제를 해결합니다. 그러나 대부분의 이러한 방법들은 보상의 탐색(exploration)과 활용(exploitation) 사이의 적절한 균형을 맞추지 못하는데, 이는 종종 이 두 측면이 혼재된 형태로 구현되기 때문입니다. 이를 해결하기 위해 우리는 이 두 측면을 분리하고 진화적 방식으로 구현했습니다: 탐색은 변형된 명령어로부터 디코딩함으로써 강제되고, 활용은 보상이 낮은 세대를 주기적으로 보상이 높은 세대로 대체하는 것으로 나타냅니다. 실험 결과는 이 전략이 널리 인정받는 정렬 벤치마크인 AlpacaEval 2와 MT-Bench에서 많은 선호 최적화 및 디코딩 시점 정렬 접근법을 능가함을 보여줍니다. 우리의 구현은 https://darwin-alignment.github.io에서 확인할 수 있습니다.

English

The widespread applicability and increasing omnipresence of LLMs have instigated a need to align LLM responses to user and stakeholder preferences. Many preference optimization approaches have been proposed that fine-tune LLM parameters to achieve good alignment. However, such parameter tuning is known to interfere with model performance on many tasks. Moreover, keeping up with shifting user preferences is tricky in such a situation. Decoding-time alignment with reward model guidance solves these issues at the cost of increased inference time. However, most of such methods fail to strike the right balance between exploration and exploitation of reward -- often due to the conflated formulation of these two aspects - to give well-aligned responses. To remedy this we decouple these two aspects and implement them in an evolutionary fashion: exploration is enforced by decoding from mutated instructions and exploitation is represented as the periodic replacement of poorly-rewarded generations with well-rewarded ones. Empirical evidences indicate that this strategy outperforms many preference optimization and decode-time alignment approaches on two widely accepted alignment benchmarks AlpacaEval 2 and MT-Bench. Our implementation will be available at: https://darwin-alignment.github.io.

디코딩 시점 정렬을 위한 진화적 휴리스틱 기반 보상 조정

Reward Steering with Evolutionary Heuristics for Decoding-time Alignment

초록

Support