AnyCap項目:一個統一框架、數據集與基準測試平台,用於可控的全模態字幕生成
AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning
July 17, 2025
作者: Yiming Ren, Zhiqiang Lin, Yu Li, Gao Meng, Weiyun Wang, Junjie Wang, Zicheng Lin, Jifeng Dai, Yujiu Yang, Wenhai Wang, Ruihang Chu
cs.AI
摘要
可控字幕生成對於精確的多模態對齊和指令遵循至關重要,然而現有模型往往缺乏細粒度控制和可靠的評估協議。為解決這一問題,我們提出了AnyCap項目,這是一個涵蓋模型、數據集和評估的綜合解決方案。我們引入了AnyCapModel(ACM),這是一個輕量級的即插即用框架,能夠在不重新訓練基礎模型的情況下,增強現有基礎模型在全方位模態字幕生成中的可控性。ACM重用了基礎模型的原始字幕,同時結合用戶指令和模態特徵來生成改進的字幕。為彌補可控多模態字幕生成中數據稀缺的問題,我們構建了AnyCapDataset(ACD),涵蓋了三種模態、28種用戶指令類型和30萬條高質量數據條目。我們進一步提出了AnyCapEval,這是一個新的基準測試,通過解耦內容準確性和風格保真度,為可控字幕生成提供了更可靠的評估指標。ACM在AnyCapEval上顯著提升了多種基礎模型的字幕質量。值得注意的是,ACM-8B將GPT-4o的內容分數提高了45%,風格分數提高了12%,並且在廣泛使用的基準測試如MIA-Bench和VidCapBench上也取得了顯著的提升。
English
Controllable captioning is essential for precise multimodal alignment and
instruction following, yet existing models often lack fine-grained control and
reliable evaluation protocols. To address this gap, we present the AnyCap
Project, an integrated solution spanning model, dataset, and evaluation. We
introduce AnyCapModel (ACM), a lightweight plug-and-play framework that
enhances the controllability of existing foundation models for omni-modal
captioning without retraining the base model. ACM reuses the original captions
from base models while incorporating user instructions and modality features to
generate improved captions. To remedy the data scarcity in controllable
multimodal captioning, we build AnyCapDataset (ACD), covering three modalities,
28 user-instruction types, and 300\,k high-quality data entries. We further
propose AnyCapEval, a new benchmark that provides more reliable evaluation
metrics for controllable captioning by decoupling content accuracy and
stylistic fidelity. ACM markedly improves caption quality across a diverse set
of base models on AnyCapEval. Notably, ACM-8B raises GPT-4o\'s content scores
by 45\% and style scores by 12\%, and it also achieves substantial gains on
widely used benchmarks such as MIA-Bench and VidCapBench.