대규모 비디오 언어 모델의 자기 정렬: 정제된 정규화 선호 최적화

초록

대형 비디오 언어 모델(LVLMs)의 최근 발전에도 불구하고, 이러한 모델들은 여전히 세밀한 시간적 이해에 어려움을 겪고, 환각(hallucination)을 일으키며, 심지어 단순한 비디오 질의응답 작업에서도 간단한 실수를 자주 저지릅니다. 이러한 문제들은 실제 애플리케이션에서 LVLMs의 안전하고 신뢰할 수 있는 배포에 상당한 도전을 제기합니다. 이러한 한계를 해결하기 위해, 우리는 LVLMs가 스스로의 오류로부터 학습할 수 있도록 하는 자기 정렬(self-alignment) 프레임워크를 제안합니다. 우리가 제안한 프레임워크는 먼저 선호되는 응답과 비선호되는 응답 쌍으로 구성된 훈련 데이터셋을 구축합니다. 여기서 비선호되는 응답은 부적절한 시공간적 이해, 동시 발생 개념 간의 허위 상관관계, 그리고 시각적 모달리티를 무시한 채 언어적 단서에 지나치게 의존하는 등 자주 발생하는 일반적인 오류 패턴을 반영하여 생성됩니다. 이러한 선호 및 비선호 응답 쌍을 통해 LVLMs의 자기 정렬을 촉진하기 위해, 우리는 Refined Regularized Preference Optimization (RRPO)이라는 새로운 선호 최적화 방법을 도입합니다. RRPO는 하위 시퀀스 수준의 정제된 보상과 토큰 단위의 KL 정규화를 활용하여 Direct Preference Optimization (DPO)의 한계를 해결합니다. 우리는 RRPO가 DPO에 비해 더 정확한 정렬과 더 안정적인 훈련을 달성함을 입증합니다. 우리의 실험과 분석은 비디오 환각, 짧은 및 긴 비디오 이해, 그리고 세밀한 시간적 추론을 포함한 다양한 비디오 작업에서 우리의 접근 방식의 효과성을 검증합니다.

English

Despite recent advances in Large Video Language Models (LVLMs), they still struggle with fine-grained temporal understanding, hallucinate, and often make simple mistakes on even simple video question-answering tasks, all of which pose significant challenges to their safe and reliable deployment in real-world applications. To address these limitations, we propose a self-alignment framework that enables LVLMs to learn from their own errors. Our proposed framework first obtains a training set of preferred and non-preferred response pairs, where non-preferred responses are generated by incorporating common error patterns that often occur due to inadequate spatio-temporal understanding, spurious correlations between co-occurring concepts, and over-reliance on linguistic cues while neglecting the vision modality, among others. To facilitate self-alignment of LVLMs with the constructed preferred and non-preferred response pairs, we introduce Refined Regularized Preference Optimization (RRPO), a novel preference optimization method that utilizes sub-sequence-level refined rewards and token-wise KL regularization to address the limitations of Direct Preference Optimization (DPO). We demonstrate that RRPO achieves more precise alignment and more stable training compared to DPO. Our experiments and analysis validate the effectiveness of our approach across diverse video tasks, including video hallucination, short- and long-video understanding, and fine-grained temporal reasoning.

대규모 비디오 언어 모델의 자기 정렬: 정제된 정규화 선호 최적화

Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization

초록

Support