LongPO：基於短到長偏好優化的大語言模型長上下文自我演化

摘要

大型語言模型（LLMs）通過預訓練和對齊展現了卓越的能力。然而，在短上下文場景中表現優異的LLMs，在長上下文情境下可能表現不佳，這主要是由於長上下文對齊的不足。這一對齊過程面臨挑戰，原因在於人類對長上下文進行註解的不可行性，以及平衡短上下文和長上下文性能的難度。為解決這些問題，我們引入了LongPO，它使短上下文LLMs能夠通過內部轉移短上下文能力，自我進化以在長上下文任務中表現出色。LongPO利用LLMs從自我生成的短到長偏好數據中學習，這些數據包括針對相同指令但分別以長上下文輸入及其壓縮後的短上下文對應版本生成的回應對。這種偏好揭示了在短上下文對齊過程中培養的LLMs能力與潛力，這些在未充分對齊的長上下文場景中可能被削弱。此外，LongPO還引入了短到長的KL約束，以減輕長上下文對齊過程中短上下文性能的下降。當應用於從128K到512K上下文長度的Mistral-7B-Instruct-v0.2時，LongPO完全保留了短上下文性能，並在長短上下文任務中大幅超越簡單的SFT和DPO。具體而言，經過\ourMethod訓練的模型在長上下文基準測試中的結果，可與甚至超越那些涉及大量長上下文註解和更大參數規模的優質LLMs（如GPT-4-128K）相媲美。

English

Large Language Models (LLMs) have demonstrated remarkable capabilities through pretraining and alignment. However, superior short-context LLMs may underperform in long-context scenarios due to insufficient long-context alignment. This alignment process remains challenging due to the impracticality of human annotation for extended contexts and the difficulty in balancing short- and long-context performance. To address these challenges, we introduce LongPO, that enables short-context LLMs to self-evolve to excel on long-context tasks by internally transferring short-context capabilities. LongPO harnesses LLMs to learn from self-generated short-to-long preference data, comprising paired responses generated for identical instructions with long-context inputs and their compressed short-context counterparts, respectively. This preference reveals capabilities and potentials of LLMs cultivated during short-context alignment that may be diminished in under-aligned long-context scenarios. Additionally, LongPO incorporates a short-to-long KL constraint to mitigate short-context performance decline during long-context alignment. When applied to Mistral-7B-Instruct-v0.2 from 128K to 512K context lengths, LongPO fully retains short-context performance and largely outperforms naive SFT and DPO in both long- and short-context tasks. Specifically, \ourMethod-trained models can achieve results on long-context benchmarks comparable to, or even surpassing, those of superior LLMs (e.g., GPT-4-128K) that involve extensive long-context annotation and larger parameter scales.

LongPO：基於短到長偏好優化的大語言模型長上下文自我演化

LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization

摘要

Support