AnyCapプロジェクト：制御可能な全モーダルキャプショニングのための統合フレームワーク、データセット、ベンチマーク

要旨

精密なマルチモーダルアラインメントと指示追従のためには、制御可能なキャプショニングが不可欠である。しかし、既存のモデルはしばしば細かな制御性と信頼性のある評価プロトコルを欠いている。このギャップを埋めるため、我々はAnyCapプロジェクトを提案する。これは、モデル、データセット、評価を包括する統合ソリューションである。我々はAnyCapModel（ACM）を紹介する。これは、ベースモデルの再学習を必要とせずに、既存の基盤モデルのオムニモーダルキャプショニングの制御性を向上させる軽量なプラグアンドプレイフレームワークである。ACMは、ベースモデルからの元のキャプションを再利用しつつ、ユーザー指示とモダリティ特徴を取り入れて改善されたキャプションを生成する。制御可能なマルチモーダルキャプショニングにおけるデータ不足を補うため、我々はAnyCapDataset（ACD）を構築した。これは3つのモダリティ、28種類のユーザー指示タイプ、および30万件の高品質データエントリをカバーする。さらに、我々はAnyCapEvalを提案する。これは、内容の正確性と文体の忠実性を分離することで、制御可能なキャプショニングのためのより信頼性のある評価指標を提供する新しいベンチマークである。ACMは、AnyCapEvalにおいて、多様なベースモデルにわたってキャプション品質を大幅に向上させる。特に、ACM-8BはGPT-4oの内容スコアを45％、スタイルスコアを12％向上させ、MIA-BenchやVidCapBenchなどの広く使用されているベンチマークでも大幅な改善を達成する。

English

Controllable captioning is essential for precise multimodal alignment and instruction following, yet existing models often lack fine-grained control and reliable evaluation protocols. To address this gap, we present the AnyCap Project, an integrated solution spanning model, dataset, and evaluation. We introduce AnyCapModel (ACM), a lightweight plug-and-play framework that enhances the controllability of existing foundation models for omni-modal captioning without retraining the base model. ACM reuses the original captions from base models while incorporating user instructions and modality features to generate improved captions. To remedy the data scarcity in controllable multimodal captioning, we build AnyCapDataset (ACD), covering three modalities, 28 user-instruction types, and 300\,k high-quality data entries. We further propose AnyCapEval, a new benchmark that provides more reliable evaluation metrics for controllable captioning by decoupling content accuracy and stylistic fidelity. ACM markedly improves caption quality across a diverse set of base models on AnyCapEval. Notably, ACM-8B raises GPT-4o\'s content scores by 45\% and style scores by 12\%, and it also achieves substantial gains on widely used benchmarks such as MIA-Bench and VidCapBench.

AnyCapプロジェクト：制御可能な全モーダルキャプショニングのための統合フレームワーク、データセット、ベンチマーク

AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning

要旨

Support