MLLM의 근거 기반 추론을 통한 AI 생성 이미지의 해석 가능하고 신뢰할 수 있는 탐지

초록

이미지 생성 기술의 급속한 발전은 해석 가능하고 견고한 탐지 방법에 대한 수요를 더욱 증대시키고 있습니다. 기존 접근법들은 높은 정확도를 달성하는 경우가 많지만, 일반적으로 인간이 이해할 수 있는 근거를 제공하지 않는 블랙박스 형태로 작동합니다. 다중 모달 대형 언어 모델(MLLMs)은 원래 위조 탐지를 위해 설계된 것은 아니지만, 강력한 분석 및 추론 능력을 보여줍니다. 적절하게 미세 조정될 경우, 이 모델들은 AI 생성 이미지를 효과적으로 식별하고 의미 있는 설명을 제공할 수 있습니다. 그러나 기존 MLLMs는 여전히 환각(hallucination) 문제를 겪으며, 시각적 해석을 실제 이미지 내용과 인간의 추론에 맞추는 데 어려움을 겪습니다. 이러한 격차를 해소하기 위해, 우리는 합성 아티팩트를 강조하는 바운딩 박스와 설명 캡션이 포함된 AI 생성 이미지 데이터셋을 구축하여 인간과 일치하는 시각-텍스트 기반 추론의 기반을 마련했습니다. 그런 다음, 정확한 탐지, 시각적 위치 지정, 일관된 텍스트 설명이라는 목표를 점진적으로 균형 있게 조정하는 다단계 최적화 전략을 통해 MLLMs를 미세 조정했습니다. 결과적으로 얻은 모델은 AI 생성 이미지를 탐지하고 시각적 결함을 위치 지정하는 데 있어서 우수한 성능을 달성하며, 기준선 방법들을 크게 능가합니다.

English

The rapid advancement of image generation technologies intensifies the demand for interpretable and robust detection methods. Although existing approaches often attain high accuracy, they typically operate as black boxes without providing human-understandable justifications. Multi-modal Large Language Models (MLLMs), while not originally intended for forgery detection, exhibit strong analytical and reasoning capabilities. When properly fine-tuned, they can effectively identify AI-generated images and offer meaningful explanations. However, existing MLLMs still struggle with hallucination and often fail to align their visual interpretations with actual image content and human reasoning. To bridge this gap, we construct a dataset of AI-generated images annotated with bounding boxes and descriptive captions that highlight synthesis artifacts, establishing a foundation for human-aligned visual-textual grounded reasoning. We then finetune MLLMs through a multi-stage optimization strategy that progressively balances the objectives of accurate detection, visual localization, and coherent textual explanation. The resulting model achieves superior performance in both detecting AI-generated images and localizing visual flaws, significantly outperforming baseline methods.