ChatPaper.aiChatPaper

变漂移为约束:非平稳环境下的鲁棒推理对齐

Turning Drift into Constraint: Robust Reasoning Alignment in Non-Stationary Environments

May 2, 2026
作者: Xiaoyu Yang, En Yu, Wei Duan, Jie Lu
cs.AI

摘要

本文揭示了多模态大语言模型(MLLMs)推理对齐中一个关键但尚未充分探索的挑战:在非平稳环境下,源模型多样化的推理分布常发生不可预测的演化,将系统性偏差与漂移传递至目标模型。为此,我们基于概念漂移理论将多源推理对齐建模为约束满足问题,提出自主偏好优化(APO)框架。该框架将模型间差异视为动态负约束而非噪声,通过两阶段协议实现对齐:首先,监督式引导将目标模型投射至源模型的能力并集;其次,约束感知优化通过多负例Plackett-Luce目标显式抑制漂移轨迹,合成一致的共识流形。在胸部X光解读任务上的大量实验表明,我们的70亿参数模型展现出卓越的鲁棒性,其平均准确率甚至超越专有源模型。此外,我们发布了CXR-MAX大规模基准数据集,包含来自七个大型MLLMs的170,982条推理轨迹,以推动漂移环境下推理对齐的研究。代码与数据详见:https://github.com/XiaoyuYoung/APO。
English
This paper identifies a critical yet underexplored challenge in reasoning alignment from multiple multi-modal large language models (MLLMs): In non-stationary environments, the diverse reasoning distributions of source models often evolve unpredictably, transmitting systematic biases and drift to the target model. To address this, we formulate multi-source reasoning alignment as a constraint satisfaction problem under concept drift theory. We propose Autonomous Preference Optimization (APO), a novel framework that treats inter-model divergences not as noise, but as dynamic negative constraints. APO operates via a two-stage protocol: first, supervised bootstrapping projects the target model into the capability union of source models; second, constraint-aware optimization synthesizes a consistent consensus manifold by explicitly suppressing drifting trajectories via a multi-negative Plackett-Luce objective. Extensive experiments on chest X-ray interpretation demonstrate that our 7B model achieves superior robustness, outperforming even proprietary source models in average accuracy. Furthermore, we release CXR-MAX, a large-scale benchmark comprising 170,982 reasoning trajectories from seven large-scale MLLMs to facilitate research on reasoning alignment under drift. Code and data are available at: https://github.com/XiaoyuYoung/APO.
PDF11May 8, 2026