ChatPaper.aiChatPaper

Openstory++:一个大规模数据集和基准,用于实例感知的开放域视觉叙事

Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling

August 7, 2024
作者: Zilyu Ye, Jinxiu Liu, Ruotian Peng, Jinjin Cao, Zhiyang Chen, Yiyang Zhang, Ziwei Xuan, Mingyuan Zhou, Xiaoqian Shen, Mohamed Elhoseiny, Qi Liu, Guo-Jun Qi
cs.AI

摘要

最近的图像生成模型擅长根据简短描述创建高质量图像。然而,在遇到长篇上下文时,它们未能保持多个实例在图像间的一致性。这种不一致主要是由于现有训练数据集中缺乏实例特征标注。为了解决这些问题,我们引入了Openstory++,这是一个大规模数据集,结合了额外的实例级注释、图像和文本。此外,我们开发了一种训练方法,强调实体为中心的图像-文本生成,确保模型学会有效地交织视觉和文本信息。具体来说,Openstory++简化了从开放域视频中提取关键帧的过程,利用视觉-语言模型生成标题,然后由大型语言模型进行润色以保持叙事连贯性。它通过提供更广泛的开放域资源,包括自动字幕生成、针对实例计数量身定制的高分辨率图像以及用于时间一致性的广泛帧序列,超越了先前的数据集。此外,我们提出了Cohere-Bench,一个开创性的基准框架,用于评估在提供长多模态上下文时的图像生成任务,包括保持给定上下文中背景、风格和实例的连贯性能力。与现有基准相比,我们的工作填补了多模态生成中的关键空白,推动了能够熟练生成和解释开放域环境中复杂叙事的模型的发展。在Cohere-Bench中进行的实验证实了Openstory++在培养高质量视觉叙事模型方面的优越性,增强了它们解决开放域生成任务的能力。更多详细信息请访问https://openstorypp.github.io/。
English
Recent image generation models excel at creating high-quality images from brief captions. However, they fail to maintain consistency of multiple instances across images when encountering lengthy contexts. This inconsistency is largely due to in existing training datasets the absence of granular instance feature labeling in existing training datasets. To tackle these issues, we introduce Openstory++, a large-scale dataset combining additional instance-level annotations with both images and text. Furthermore, we develop a training methodology that emphasizes entity-centric image-text generation, ensuring that the models learn to effectively interweave visual and textual information. Specifically, Openstory++ streamlines the process of keyframe extraction from open-domain videos, employing vision-language models to generate captions that are then polished by a large language model for narrative continuity. It surpasses previous datasets by offering a more expansive open-domain resource, which incorporates automated captioning, high-resolution imagery tailored for instance count, and extensive frame sequences for temporal consistency. Additionally, we present Cohere-Bench, a pioneering benchmark framework for evaluating the image generation tasks when long multimodal context is provided, including the ability to keep the background, style, instances in the given context coherent. Compared to existing benchmarks, our work fills critical gaps in multi-modal generation, propelling the development of models that can adeptly generate and interpret complex narratives in open-domain environments. Experiments conducted within Cohere-Bench confirm the superiority of Openstory++ in nurturing high-quality visual storytelling models, enhancing their ability to address open-domain generation tasks. More details can be found at https://openstorypp.github.io/

Summary

AI-Generated Summary

PDF132November 28, 2024