고품질 데이터셋 및 인터리브 이미지-텍스트 생성을 위한 신뢰할 수 있는 평가

초록

대규모 다중모달 모델(Large Multimodal Models, LMMs)의 최근 발전은 다중모달 이해 및 생성 능력을 크게 향상시켰다. 그러나 이러한 모델들은 여전히 긴밀하게 교차된 이미지-텍스트 출력을 생성하는 데 어려움을 겪고 있으며, 이는 주로 현재의 훈련 데이터셋의 제한된 규모, 품질 및 지시의 풍부성 때문이다. 이를 해결하기 위해, 우리는 Self-Evaluation with Iterative Refinement(SEIR) 방법을 사용하여 구축한 대규모 다중모달 데이터셋인 InterSyn을 소개한다. InterSyn은 다중 턴의 지시 기반 대화와 긴밀하게 교차된 이미지-텍스트 응답을 특징으로 하며, 풍부한 객체 다양성과 엄격한 자동 품질 개선을 제공하여 차세대 지시 수행 LMMs 훈련에 적합하다. 또한, 교차된 다중모달 출력을 평가할 수 있는 신뢰할 만한 평가 도구의 부족을 해결하기 위해, 우리는 텍스트 내용, 이미지 내용, 이미지 품질, 이미지-텍스트 시너지라는 네 가지 차원에서 다중모달 출력을 정량적으로 평가하도록 설계된 자동 평가 모델인 SynJudge를 소개한다. 실험 연구는 SEIR 방법이 개선 없이 동일한 프로세스에 비해 데이터셋 품질을 크게 향상시킨다는 것을 보여준다. 또한, InterSyn으로 훈련된 LMMs는 모든 평가 지표에서 균일한 성능 향상을 달성하여 다중모달 시스템 발전을 위한 InterSyn의 유용성을 확인한다.

English

Recent advancements in Large Multimodal Models (LMMs) have significantly improved multimodal understanding and generation. However, these models still struggle to generate tightly interleaved image-text outputs, primarily due to the limited scale, quality and instructional richness of current training datasets. To address this, we introduce InterSyn, a large-scale multimodal dataset constructed using our Self-Evaluation with Iterative Refinement (SEIR) method. InterSyn features multi-turn, instruction-driven dialogues with tightly interleaved imagetext responses, providing rich object diversity and rigorous automated quality refinement, making it well-suited for training next-generation instruction-following LMMs. Furthermore, to address the lack of reliable evaluation tools capable of assessing interleaved multimodal outputs, we introduce SynJudge, an automatic evaluation model designed to quantitatively assess multimodal outputs along four dimensions: text content, image content, image quality, and image-text synergy. Experimental studies show that the SEIR method leads to substantially higher dataset quality compared to an otherwise identical process without refinement. Moreover, LMMs trained on InterSyn achieve uniform performance gains across all evaluation metrics, confirming InterSyn's utility for advancing multimodal systems.