利用演化啟發式方法對解碼時間進行獎勵引導

摘要

LLM 的廣泛應用和日益普及性促使了將 LLM 回應與使用者和利益相關者偏好對齊的需求。已經提出了許多偏好優化方法，這些方法微調 LLM 參數以實現良好對齊。然而，已知這種參數調整會干擾模型在許多任務上的表現。此外，在這種情況下跟上變化中的使用者偏好是棘手的。根據獎勵模型指導的解碼時間對齊解決了這些問題，但代價是增加了推理時間。然而，大多數這類方法未能在探索和利用獎勵之間取得適當平衡，這往往是由於這兩個方面的混合制定，無法提供良好對齊的回應。為了解決這個問題，我們將這兩個方面解耦並以演化方式實施：通過解碼從突變指令中強制執行探索，並將利用表示為將獎勵較低的世代週期性替換為獎勵較高的世代。實證證據表明，這種策略在兩個廣泛接受的對齊基準 AlpacaEval 2 和 MT-Bench 上優於許多偏好優化和解碼時間對齊方法。我們的實現將可在以下網址找到：https://darwin-alignment.github.io。

English

The widespread applicability and increasing omnipresence of LLMs have instigated a need to align LLM responses to user and stakeholder preferences. Many preference optimization approaches have been proposed that fine-tune LLM parameters to achieve good alignment. However, such parameter tuning is known to interfere with model performance on many tasks. Moreover, keeping up with shifting user preferences is tricky in such a situation. Decoding-time alignment with reward model guidance solves these issues at the cost of increased inference time. However, most of such methods fail to strike the right balance between exploration and exploitation of reward -- often due to the conflated formulation of these two aspects - to give well-aligned responses. To remedy this we decouple these two aspects and implement them in an evolutionary fashion: exploration is enforced by decoding from mutated instructions and exploitation is represented as the periodic replacement of poorly-rewarded generations with well-rewarded ones. Empirical evidences indicate that this strategy outperforms many preference optimization and decode-time alignment approaches on two widely accepted alignment benchmarks AlpacaEval 2 and MT-Bench. Our implementation will be available at: https://darwin-alignment.github.io.

利用演化啟發式方法對解碼時間進行獎勵引導

Reward Steering with Evolutionary Heuristics for Decoding-time Alignment

摘要

Summary

Support

Support