WavAlign：通过自适应混合后训练提升口语对话模型的智能性与表达力

摘要

端到端语音对话模型因其在表现力与感知能力上较级联系统具备更高潜力上限而备受关注。然而当前开源语音对话模型的智能水平与表达能力常低于预期。受在线强化学习在其他领域成功的启发，研究者或可尝试将偏好优化直接应用于语音对话模型，但这一迁移并非易事。我们从奖励建模和 rollout 采样的角度分析这些障碍，重点关注稀疏偏好监督如何与共享参数更新下的密集语音生成相互作用。基于此分析，我们提出一种模态感知的自适应后训练方案，使强化学习能切实应用于语音对话：该方案通过将偏好更新约束在语义通道，并借助显式锚定改善声学行为，同时根据 rollout 统计量动态调节二者的混合比例，以规避不可靠的偏好梯度。我们在多个语音对话基准测试和典型架构上评估该方法，观察到语义质量与语音表现力均获得持续提升。

English

End-to-end spoken dialogue models have garnered significant attention because they offer a higher potential ceiling in expressiveness and perceptual ability than cascaded systems. However, the intelligence and expressiveness of current open-source spoken dialogue models often remain below expectations. Motivated by the success of online reinforcement learning(RL) in other domains, one might attempt to directly apply preference optimization to spoken dialogue models, yet this transfer is non-trivial. We analyze these obstacles from the perspectives of reward modeling and rollout sampling, focusing on how sparse preference supervision interacts with dense speech generation under shared-parameter updates. Based on the analysis, we propose a modality-aware adaptive post-training recipe that makes RL practical for spoken dialogue: it constrains preference updates to the semantic channel and improves acoustic behavior via explicit anchoring, while dynamically regulating their mixture from rollout statistics to avoid unreliable preference gradients. We evaluate the method across multiple spoken dialogue benchmarks and representative architectures, and observe consistent improvements in semantic quality and speech expressiveness.

WavAlign：通过自适应混合后训练提升口语对话模型的智能性与表达力

WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training

摘要

Support