潜在的な好み最適化を通じた適応デコーディング

要旨

言語モデルのデコーディング中、より高い温度のサンプリングを使用すると、より創造的な応答が得られる一方、低い温度では事実に基づいた応答が得られます。しかしながら、このようなモデルは一般的に一定の温度を全ての例やトークンに適用する一般的な指示に従うタスクに使用されます。本研究では、性能を最適化するために、推論時にサンプリング温度をトークンレベルまたは例レベルで動的に選択するためのモデルに追加されるレイヤーである適応デコーディングを紹介します。パラメータを学習するために、我々は潜在的な選好最適化（LPO）を導入し、温度の選択などの離散的な潜在変数を訓練する一般的なアプローチを提案します。UltraFeedback、創造的なストーリー執筆、GSM8Kを含むさまざまな温度が必要なタスクに対して、当社の手法はすべての固定デコーディング温度を上回る性能を発揮します。

English

During language model decoding, it is known that using higher temperature sampling gives more creative responses, while lower temperatures are more factually accurate. However, such models are commonly applied to general instruction following, which involves both creative and fact seeking tasks, using a single fixed temperature across all examples and tokens. In this work, we introduce Adaptive Decoding, a layer added to the model to select the sampling temperature dynamically at inference time, at either the token or example level, in order to optimize performance. To learn its parameters we introduce Latent Preference Optimization (LPO) a general approach to train discrete latent variables such as choices of temperature. Our method outperforms all fixed decoding temperatures across a range of tasks that require different temperatures, including UltraFeedback, Creative Story Writing, and GSM8K.

潜在的な好み最適化を通じた適応デコーディング

Adaptive Decoding via Latent Preference Optimization

要旨

Support