AnyCap项目：可控全模态字幕生成的统一框架、数据集与基准

摘要

可控描述生成对于实现精准的多模态对齐和指令跟随至关重要，然而现有模型往往缺乏细粒度控制及可靠的评估协议。为填补这一空白，我们推出了AnyCap项目，这是一个涵盖模型、数据集和评估的一体化解决方案。我们引入了AnyCapModel（ACM），一个轻量级即插即用框架，它增强了现有基础模型在全模态描述生成中的可控性，而无需重新训练基础模型。ACM在重用基础模型原有描述的同时，融入用户指令和模态特征，以生成更优的描述。针对可控多模态描述生成中数据稀缺的问题，我们构建了AnyCapDataset（ACD），涵盖三种模态、28种用户指令类型及30万条高质量数据条目。我们进一步提出了AnyCapEval，这一新基准通过解耦内容准确性与风格忠实度，为可控描述生成提供了更为可靠的评估指标。在AnyCapEval上，ACM显著提升了多种基础模型的描述质量。值得注意的是，ACM-8B使GPT-4o的内容得分提升了45%，风格得分提升了12%，同时在MIA-Bench和VidCapBench等广泛使用的基准测试中也取得了显著进步。

English

Controllable captioning is essential for precise multimodal alignment and instruction following, yet existing models often lack fine-grained control and reliable evaluation protocols. To address this gap, we present the AnyCap Project, an integrated solution spanning model, dataset, and evaluation. We introduce AnyCapModel (ACM), a lightweight plug-and-play framework that enhances the controllability of existing foundation models for omni-modal captioning without retraining the base model. ACM reuses the original captions from base models while incorporating user instructions and modality features to generate improved captions. To remedy the data scarcity in controllable multimodal captioning, we build AnyCapDataset (ACD), covering three modalities, 28 user-instruction types, and 300\,k high-quality data entries. We further propose AnyCapEval, a new benchmark that provides more reliable evaluation metrics for controllable captioning by decoupling content accuracy and stylistic fidelity. ACM markedly improves caption quality across a diverse set of base models on AnyCapEval. Notably, ACM-8B raises GPT-4o\'s content scores by 45\% and style scores by 12\%, and it also achieves substantial gains on widely used benchmarks such as MIA-Bench and VidCapBench.

AnyCap项目：可控全模态字幕生成的统一框架、数据集与基准

AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning

摘要

Support