Openstory++：一個大規模資料集和基準，用於實例感知的開放域視覺敘事。

摘要

近期的圖像生成模型擅長從簡短描述中創建高質量圖像。然而，當遇到冗長上下文時，這些模型在跨圖像多個實例的一致性上表現不佳。這種不一致性主要是由於現有訓練數據集中缺乏對實例特徵進行細粒度標記。為了應對這些問題，我們引入了Openstory++，這是一個大規模數據集，結合了額外的實例級標註、圖像和文本。此外，我們開發了一種強調實體為中心的圖像-文本生成訓練方法，確保模型學習有效地交織視覺和文本信息。具體來說，Openstory++ 簡化了從開放域視頻中提取關鍵幀的過程，利用視覺語言模型生成標題，然後通過大型語言模型進行敘事連貫性的修飾。它通過提供更豐富的開放域資源，包括自動標題生成、針對實例計數量身定制的高分辨率圖像以及用於時間一致性的廣泛幀序列，勝過了以往的數據集。此外，我們還提出了 Cohere-Bench，這是一個用於評估當提供長多模態上下文時的圖像生成任務的開創性基準框架，包括保持給定上下文中背景、風格和實例的一致性能力。與現有基準相比，我們的工作填補了多模態生成中的關鍵空白，推動了能夠熟練生成和解釋開放域環境中複雜敘事的模型的發展。在 Cohere-Bench 中進行的實驗證實了 Openstory++ 在培育高質量視覺故事模型方面的優越性，增強了它們應對開放域生成任務的能力。更多詳細信息請參見 https://openstorypp.github.io/

English

Recent image generation models excel at creating high-quality images from brief captions. However, they fail to maintain consistency of multiple instances across images when encountering lengthy contexts. This inconsistency is largely due to in existing training datasets the absence of granular instance feature labeling in existing training datasets. To tackle these issues, we introduce Openstory++, a large-scale dataset combining additional instance-level annotations with both images and text. Furthermore, we develop a training methodology that emphasizes entity-centric image-text generation, ensuring that the models learn to effectively interweave visual and textual information. Specifically, Openstory++ streamlines the process of keyframe extraction from open-domain videos, employing vision-language models to generate captions that are then polished by a large language model for narrative continuity. It surpasses previous datasets by offering a more expansive open-domain resource, which incorporates automated captioning, high-resolution imagery tailored for instance count, and extensive frame sequences for temporal consistency. Additionally, we present Cohere-Bench, a pioneering benchmark framework for evaluating the image generation tasks when long multimodal context is provided, including the ability to keep the background, style, instances in the given context coherent. Compared to existing benchmarks, our work fills critical gaps in multi-modal generation, propelling the development of models that can adeptly generate and interpret complex narratives in open-domain environments. Experiments conducted within Cohere-Bench confirm the superiority of Openstory++ in nurturing high-quality visual storytelling models, enhancing their ability to address open-domain generation tasks. More details can be found at https://openstorypp.github.io/

Openstory++：一個大規模資料集和基準，用於實例感知的開放域視覺敘事。

Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling

摘要

Support