OmniScript: 長編映像コンテンツのための音響視覚的脚本生成に向けて

要旨

現在のマルチモーダル大規模言語モデル（MLLM）は短編動画の理解において顕著な能力を示しているが、長編の映像作品を時間的に裏付けられた詳細な脚本に変換することは依然として大きな課題である。本論文は、登場人物の行動、台詞、表情、音声キューを含む階層的でシーン単位の脚本生成を目的とする、新しいビデオ・ツー・スクリプト（V2S）タスクを提案する。これを促進するため、初の人手注釈によるベンチマークを構築し、時間的意識を持つ階層的評価フレームワークを提案する。さらに、長編ナラティブ理解に特化した80億パラメータのオムニモーダル（音声-視覚）言語モデルOmniScriptを提示する。OmniScriptは、プロットと登場人物の推論のための連鎖的思考に基づく教師付きファインチューニングを活用し、その後時間的に分割された報酬を用いた強化学習を行うという段階的なパイプラインで訓練される。大規模な実験により、OmniScriptはパラメータ効率が高いにもかかわらず、大規模なオープンソースモデルを大幅に上回り、時間的定位と多分野の意味的精度の両方において、Gemini 3-Proを含む最先端のプロプライエタリモデルに匹敵する性能を達成することが実証された。

English

Current multimodal large language models (MLLMs) have demonstrated remarkable capabilities in short-form video understanding, yet translating long-form cinematic videos into detailed, temporally grounded scripts remains a significant challenge. This paper introduces the novel video-to-script (V2S) task, aiming to generate hierarchical, scene-by-scene scripts encompassing character actions, dialogues, expressions, and audio cues. To facilitate this, we construct a first-of-its-kind human-annotated benchmark and propose a temporally-aware hierarchical evaluation framework. Furthermore, we present OmniScript, an 8B-parameter omni-modal (audio-visual) language model tailored for long-form narrative comprehension. OmniScript is trained via a progressive pipeline that leverages chain-of-thought supervised fine-tuning for plot and character reasoning, followed by reinforcement learning using temporally segmented rewards. Extensive experiments demonstrate that despite its parameter efficiency, OmniScript significantly outperforms larger open-source models and achieves performance comparable to state-of-the-art proprietary models, including Gemini 3-Pro, in both temporal localization and multi-field semantic accuracy.

OmniScript: 長編映像コンテンツのための音響視覚的脚本生成に向けて

OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video

要旨

Support