Drift ombuigen tot Beperking: Robuuste Redeneeralignering in Niet-stationaire Omgevingen

Samenvatting

Dit artikel identificeert een kritieke maar onderbelichte uitdaging in het uitlijnen van redeneerprocessen van meerdere multimodale grote taalmmodellen (MLLM's): in niet-stationaire omgevingen evolueren de diverse redeneerdistributies van bronmodellen vaak onvoorspelbaar, wat systematische biases en drift doorgeeft aan het doelmodel. Om dit aan te pakken, formuleren we multi-source reasoning alignment als een constraint satisfaction probleem onder de concept drift theorie. Wij stellen Autonome Preference Optimization (APO) voor, een nieuw raamwerk dat inter-model divergenties niet als ruis behandelt, maar als dynamische negatieve constraints. APO werkt via een tweefasenprotocol: eerst projecteert supervised bootstrapping het doelmodel in de capaciteitenunie van de bronmodellen; vervolgens synthetiseert constraint-aware optimization een consistent consensusmanifold door expliciet driftende trajecten te onderdrukken via een multi-negatief Plackett-Luce doel. Uitgebreide experimenten met interpretatie van thoraxfoto's tonen aan dat ons 7B-model superieure robuustheid bereikt, en zelfs de gemiddelde nauwkeurigheid van propriëtaire bronmodellen overtreft. Verder publiceren wij CXR-MAX, een grootschalige benchmark bestaande uit 170.982 redeneertrajecten van zeven grootschalige MLLM's, om onderzoek naar reasoning alignment onder drift te faciliteren. Code en data zijn beschikbaar op: https://github.com/XiaoyuYoung/APO.

English

This paper identifies a critical yet underexplored challenge in reasoning alignment from multiple multi-modal large language models (MLLMs): In non-stationary environments, the diverse reasoning distributions of source models often evolve unpredictably, transmitting systematic biases and drift to the target model. To address this, we formulate multi-source reasoning alignment as a constraint satisfaction problem under concept drift theory. We propose Autonomous Preference Optimization (APO), a novel framework that treats inter-model divergences not as noise, but as dynamic negative constraints. APO operates via a two-stage protocol: first, supervised bootstrapping projects the target model into the capability union of source models; second, constraint-aware optimization synthesizes a consistent consensus manifold by explicitly suppressing drifting trajectories via a multi-negative Plackett-Luce objective. Extensive experiments on chest X-ray interpretation demonstrate that our 7B model achieves superior robustness, outperforming even proprietary source models in average accuracy. Furthermore, we release CXR-MAX, a large-scale benchmark comprising 170,982 reasoning trajectories from seven large-scale MLLMs to facilitate research on reasoning alignment under drift. Code and data are available at: https://github.com/XiaoyuYoung/APO.

Drift ombuigen tot Beperking: Robuuste Redeneeralignering in Niet-stationaire Omgevingen

Turning Drift into Constraint: Robust Reasoning Alignment in Non-Stationary Environments

Samenvatting

Support