TARS: MLLM에서 환각 현상 감소를 위한 MinMax 토큰 적응형 선호 전략

초록

멀티모달 대형 언어 모델(MLLMs)은 시각-언어 추론을 가능하게 하지만, 종종 사실적으로 잘못되었거나 시각적으로 근거가 없는 그럴듯한 출력을 생성하여 신뢰성을 저해합니다. 직접 선호도 최적화(Direct Preference Optimization, DPO)는 인간의 선호도와 모델 출력을 맞추어 환각(hallucination)을 교정하는 일반적인 전략입니다. 기존의 DPO 전략은 환각 관련 선호도를 고정된 목표로 취급하며, 훈련 중에 정적인 감독 신호에 의존합니다. 이 접근 방식은 선호 데이터의 표면적인 언어적 단서에 과적합되는 경향이 있어, 분포적 경직성과 인과적으로 관련된 시각 정보의 근거를 저해하는 허위 상관관계를 초래합니다. 이러한 한계를 극복하기 위해, 우리는 DPO를 최소-최대 최적화 문제로 재구성하는 토큰 적응형 선호도 전략인 TARS를 제안합니다. TARS는 의미론적 제약 하에서 토큰 수준의 분포 변화를 최대화하여 정렬 불확실성을 시뮬레이션하고, 동시에 이러한 제어된 섭동 하에서 예상 선호도 손실을 최소화합니다. 이 공동 목표는 인과적 근거를 보존하면서 선호 패턴에 대한 과적합을 완화하여 멀티모달 추론에서의 환각을 줄입니다. 우리는 TARS를 여러 환각 벤치마크에서 평가하고 일관되게 강력한 성능을 확인했습니다. 단 4.8k개의 선호도 샘플과 전문가 피드백 없이, TARS는 환각률을 26.4%에서 13.2%로 줄이고 인지 가치를 2.5에서 0.4로 감소시켰습니다. TARS는 표준 DPO를 능가하고 여러 주요 지표에서 GPT-4o와 동등한 성능을 보였습니다.

English

Multimodal large language models (MLLMs) enable vision-language reasoning, yet often generate plausible outputs that are factually incorrect or visually ungrounded, thereby compromising their reliability. Direct preference optimization (DPO) is a common strategy for correcting hallucinations by aligning model outputs with human preferences. Existing DPO strategies typically treat hallucination-related preferences as fixed targets, relying on static supervision signals during training. This approach tends to overfit to superficial linguistic cues in preference data, leading to distributional rigidity and spurious correlations that impair grounding in causally relevant visual information. To overcome this limitation, we propose TARS, a token-adaptive preference strategy that reformulates DPO as a min-max optimization problem. TARS maximizes token-level distributional shifts under semantic constraints to simulate alignment uncertainty, and simultaneously minimizes the expected preference loss under these controlled perturbations. This joint objective preserves causal grounding while mitigating overfitting to preference patterns, thereby reducing hallucinations in multimodal reasoning. We evaluate TARS on multiple hallucination benchmarks and find consistently strong performance. Using only 4.8k preference samples and no expert feedback, TARS reduces hallucination rates from 26.4% to 13.2% and decreases cognition value from 2.5 to 0.4. It outperforms standard DPO and matches GPT-4o on several key metrics.

TARS: MLLM에서 환각 현상 감소를 위한 MinMax 토큰 적응형 선호 전략

TARS: MinMax Token-Adaptive Preference Strategy for Hallucination Reduction in MLLMs

초록

Support