드리프트를 제약으로: 비정적 환경에서의 강건한 추론 정렬

초록

본 논문은 다중 멀티모달 대규모 언어 모델(MLLM) 간 추론 정렬에서 중요한 동시에 충분히 탐구되지 않은 과제를 규명한다: 비정적 환경에서 원천 모델들의 다양한 추론 분포는 종종 예측 불가능하게 진화하며, 체계적 편향과 드리프트를 대상 모델에 전파한다. 이를 해결하기 위해 우리는 개념 드리프트 이론 하에서 다중 원천 추론 정렬을 제약 조건 충족 문제로 공식화한다. 우리는 모델 간 차이를 노이즈가 아닌 동적 부정 제약 조건으로 취급하는 새로운 프레임워크인 자율 선호 최적화(APO)를 제안한다. APO는 두 단계 프로토콜로 운영된다: 첫째, 지도 부트스트래핑을 통해 대상 모델을 원천 모델들의 능력 합집합으로 투영한다; 둘째, 제약 조건 인식 최적화가 다중 부정 Plackett-Luce 목적 함수를 통해 드리프트 궤적을 명시적으로 억제함으로써 일관된 합의 매니폴드를 합성한다. 흉부 X-선 판독에 대한 광범위한 실험을 통해 우리의 70억 파라미터 모델이 우수한 강건성을 달성하며, 평균 정확도에서 독점 원천 모델들조차 능가함을 입증한다. 더 나아가, 우리는 드리프트 하 추론 정렬 연구를 촉진하기 위해 7개 대규모 MLLM으로부터 170,982개 추론 궤적으로 구성된 대규모 벤치마크 CXR-MAX를 공개한다. 코드와 데이터는 https://github.com/XiaoyuYoung/APO 에서 이용 가능하다.

English

This paper identifies a critical yet underexplored challenge in reasoning alignment from multiple multi-modal large language models (MLLMs): In non-stationary environments, the diverse reasoning distributions of source models often evolve unpredictably, transmitting systematic biases and drift to the target model. To address this, we formulate multi-source reasoning alignment as a constraint satisfaction problem under concept drift theory. We propose Autonomous Preference Optimization (APO), a novel framework that treats inter-model divergences not as noise, but as dynamic negative constraints. APO operates via a two-stage protocol: first, supervised bootstrapping projects the target model into the capability union of source models; second, constraint-aware optimization synthesizes a consistent consensus manifold by explicitly suppressing drifting trajectories via a multi-negative Plackett-Luce objective. Extensive experiments on chest X-ray interpretation demonstrate that our 7B model achieves superior robustness, outperforming even proprietary source models in average accuracy. Furthermore, we release CXR-MAX, a large-scale benchmark comprising 170,982 reasoning trajectories from seven large-scale MLLMs to facilitate research on reasoning alignment under drift. Code and data are available at: https://github.com/XiaoyuYoung/APO.

드리프트를 제약으로: 비정적 환경에서의 강건한 추론 정렬

Turning Drift into Constraint: Robust Reasoning Alignment in Non-Stationary Environments

초록

Support