다중모달 LLM을 통한 AI 생성 비디오의 인간 인지 가짜성 학습

초록

인간은 AI 생성(가짜) 동영상을 식별하고 근거를 제시할 수 있을까? 동영상 생성 모델이 빠르게 발전하고 있지만, 생성된 동영상 내에서 딥페이크 흔적, 즉 기계 생성임을 드러내는 시공간적 근거가 시각적 결함을 인간이 탐지할 수 있는지에 대한 중요한 차원은 크게 간과되어 왔다. 우리는 인간이 인지한 가짜 흔적을 동영상 생성 보상에 주석으로 달기 위해 공간적 및 시간적 인식을 갖춘 최초의 세밀한 벤치마크인 DeeptraceReward를 소개한다. 이 데이터셋은 3.3K개의 고품질 생성 동영상에 걸쳐 4.3K개의 상세한 주석으로 구성된다. 각 주석은 자연어 설명을 제공하고, 인지된 흔적을 포함하는 경계 상자 영역을 특정하며, 정확한 시작 및 종료 타임스탬프를 표시한다. 우리는 이러한 주석을 인간이 동영상을 AI 생성으로 식별하게 만드는 9가지 주요 딥페이크 흔적 범주로 통합하고, 인간의 판단과 위치 지정을 모방하기 위해 다중 모달 언어 모델(LM)을 보상 모델로 훈련시켰다. DeeptraceReward에서 우리의 7B 보상 모델은 가짜 단서 식별, 근거 제시 및 설명에서 GPT-5를 평균 34.7% 앞섰다. 흥미롭게도, 우리는 일관된 난이도 경사를 관찰했다: 이진 가짜 대 진짜 분류는 세밀한 딥페이크 흔적 탐지보다 상당히 쉬웠으며, 후자 내에서는 자연어 설명(가장 쉬움)에서 공간적 근거 제시, 시간적 라벨링(가장 어려움)으로 성능이 저하되었다. 인간이 인지한 딥페이크 흔적을 전면에 내세움으로써, DeeptraceReward는 사회적으로 인식되고 신뢰할 수 있는 동영상 생성을 위한 엄격한 테스트베드와 훈련 신호를 제공한다.

English

Can humans identify AI-generated (fake) videos and provide grounded reasons? While video generation models have advanced rapidly, a critical dimension -- whether humans can detect deepfake traces within a generated video, i.e., spatiotemporal grounded visual artifacts that reveal a video as machine generated -- has been largely overlooked. We introduce DeeptraceReward, the first fine-grained, spatially- and temporally- aware benchmark that annotates human-perceived fake traces for video generation reward. The dataset comprises 4.3K detailed annotations across 3.3K high-quality generated videos. Each annotation provides a natural-language explanation, pinpoints a bounding-box region containing the perceived trace, and marks precise onset and offset timestamps. We consolidate these annotations into 9 major categories of deepfake traces that lead humans to identify a video as AI-generated, and train multimodal language models (LMs) as reward models to mimic human judgments and localizations. On DeeptraceReward, our 7B reward model outperforms GPT-5 by 34.7% on average across fake clue identification, grounding, and explanation. Interestingly, we observe a consistent difficulty gradient: binary fake v.s. real classification is substantially easier than fine-grained deepfake trace detection; within the latter, performance degrades from natural language explanations (easiest), to spatial grounding, to temporal labeling (hardest). By foregrounding human-perceived deepfake traces, DeeptraceReward provides a rigorous testbed and training signal for socially aware and trustworthy video generation.

다중모달 LLM을 통한 AI 생성 비디오의 인간 인지 가짜성 학습

Learning Human-Perceived Fakeness in AI-Generated Videos via Multimodal LLMs

초록

Support