TARS:面向多模态大语言模型的MinMax令牌自适应偏好策略,用于减少幻觉现象
TARS: MinMax Token-Adaptive Preference Strategy for Hallucination Reduction in MLLMs
July 29, 2025
作者: Kejia Zhang, Keda Tao, Zhiming Luo, Chang Liu, Jiasheng Tang, Huan Wang
cs.AI
摘要
多模态大语言模型(MLLMs)能够实现视觉-语言推理,但常常生成看似合理却事实错误或视觉上缺乏依据的输出,从而影响其可靠性。直接偏好优化(DPO)是一种通过将模型输出与人类偏好对齐来纠正幻觉的常见策略。现有的DPO策略通常将幻觉相关的偏好视为固定目标,在训练过程中依赖静态的监督信号。这种方法容易过度拟合偏好数据中的表面语言线索,导致分布僵化和虚假关联,从而损害与因果相关视觉信息的关联性。为克服这一局限,我们提出了TARS,一种基于令牌的自适应偏好策略,将DPO重新表述为一个最小-最大优化问题。TARS在语义约束下最大化令牌级分布变化以模拟对齐不确定性,同时在这些受控扰动下最小化预期偏好损失。这一联合目标在保持因果关联性的同时,减轻了对偏好模式的过度拟合,从而减少了多模态推理中的幻觉。我们在多个幻觉基准上评估了TARS,发现其表现始终强劲。仅使用4.8k个偏好样本且无需专家反馈,TARS将幻觉率从26.4%降至13.2%,并将认知值从2.5降至0.4。它在多个关键指标上超越了标准DPO,并与GPT-4o相当。
English
Multimodal large language models (MLLMs) enable vision-language reasoning,
yet often generate plausible outputs that are factually incorrect or visually
ungrounded, thereby compromising their reliability. Direct preference
optimization (DPO) is a common strategy for correcting hallucinations by
aligning model outputs with human preferences. Existing DPO strategies
typically treat hallucination-related preferences as fixed targets, relying on
static supervision signals during training. This approach tends to overfit to
superficial linguistic cues in preference data, leading to distributional
rigidity and spurious correlations that impair grounding in causally relevant
visual information. To overcome this limitation, we propose TARS, a
token-adaptive preference strategy that reformulates DPO as a min-max
optimization problem. TARS maximizes token-level distributional shifts under
semantic constraints to simulate alignment uncertainty, and simultaneously
minimizes the expected preference loss under these controlled perturbations.
This joint objective preserves causal grounding while mitigating overfitting to
preference patterns, thereby reducing hallucinations in multimodal reasoning.
We evaluate TARS on multiple hallucination benchmarks and find consistently
strong performance. Using only 4.8k preference samples and no expert feedback,
TARS reduces hallucination rates from 26.4% to 13.2% and decreases cognition
value from 2.5 to 0.4. It outperforms standard DPO and matches GPT-4o on
several key metrics.