WavAlign：基于自适应混合后训练增强口语对话模型的智能性与表达力

摘要

端到端口語對話模型因具備比級聯系統更高的表達力與感知能力潛力上限而備受關注。然而當前開源口語對話模型的智能水平與表達能力往往未達預期。受線上強化學習在其他領域成功的啟發，研究者可能嘗試直接將偏好優化應用於口語對話模型，但這種遷移並非易事。我們從獎勵建模與滾動採樣的角度分析這些障礙，重點探討稀疏偏好監督如何與共享參數更新下的密集語音生成相互作用。基於此分析，我們提出一種模態感知的自適應後訓練方案，使強化學習能實際應用於口語對話：該方案通過將偏好更新約束於語義通道，並利用顯式錨定改進聲學行為，同時根據滾動統計數據動態調控兩者混合比例，以避免不可靠的偏好梯度。我們在多個口語對話基準測試與代表性架構上評估該方法，觀察到語義質量與語音表達力均獲得持續提升。

English

End-to-end spoken dialogue models have garnered significant attention because they offer a higher potential ceiling in expressiveness and perceptual ability than cascaded systems. However, the intelligence and expressiveness of current open-source spoken dialogue models often remain below expectations. Motivated by the success of online reinforcement learning(RL) in other domains, one might attempt to directly apply preference optimization to spoken dialogue models, yet this transfer is non-trivial. We analyze these obstacles from the perspectives of reward modeling and rollout sampling, focusing on how sparse preference supervision interacts with dense speech generation under shared-parameter updates. Based on the analysis, we propose a modality-aware adaptive post-training recipe that makes RL practical for spoken dialogue: it constrains preference updates to the semantic channel and improves acoustic behavior via explicit anchoring, while dynamically regulating their mixture from rollout statistics to avoid unreliable preference gradients. We evaluate the method across multiple spoken dialogue benchmarks and representative architectures, and observe consistent improvements in semantic quality and speech expressiveness.

WavAlign：基于自适应混合后训练增强口语对话模型的智能性与表达力

WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training

摘要

Support