变漂移为约束：非平稳环境下的鲁棒推理对齐

摘要

本文揭示了多模态大语言模型（MLLMs）推理对齐中一个关键但尚未充分探索的挑战：在非平稳环境下，源模型多样化的推理分布常发生不可预测的演化，将系统性偏差与漂移传递至目标模型。为此，我们基于概念漂移理论将多源推理对齐建模为约束满足问题，提出自主偏好优化（APO）框架。该框架将模型间差异视为动态负约束而非噪声，通过两阶段协议实现对齐：首先，监督式引导将目标模型投射至源模型的能力并集；其次，约束感知优化通过多负例Plackett-Luce目标显式抑制漂移轨迹，合成一致的共识流形。在胸部X光解读任务上的大量实验表明，我们的70亿参数模型展现出卓越的鲁棒性，其平均准确率甚至超越专有源模型。此外，我们发布了CXR-MAX大规模基准数据集，包含来自七个大型MLLMs的170,982条推理轨迹，以推动漂移环境下推理对齐的研究。代码与数据详见：https://github.com/XiaoyuYoung/APO。

English

This paper identifies a critical yet underexplored challenge in reasoning alignment from multiple multi-modal large language models (MLLMs): In non-stationary environments, the diverse reasoning distributions of source models often evolve unpredictably, transmitting systematic biases and drift to the target model. To address this, we formulate multi-source reasoning alignment as a constraint satisfaction problem under concept drift theory. We propose Autonomous Preference Optimization (APO), a novel framework that treats inter-model divergences not as noise, but as dynamic negative constraints. APO operates via a two-stage protocol: first, supervised bootstrapping projects the target model into the capability union of source models; second, constraint-aware optimization synthesizes a consistent consensus manifold by explicitly suppressing drifting trajectories via a multi-negative Plackett-Luce objective. Extensive experiments on chest X-ray interpretation demonstrate that our 7B model achieves superior robustness, outperforming even proprietary source models in average accuracy. Furthermore, we release CXR-MAX, a large-scale benchmark comprising 170,982 reasoning trajectories from seven large-scale MLLMs to facilitate research on reasoning alignment under drift. Code and data are available at: https://github.com/XiaoyuYoung/APO.

变漂移为约束：非平稳环境下的鲁棒推理对齐

Turning Drift into Constraint: Robust Reasoning Alignment in Non-Stationary Environments

摘要

Support