OmniScript: 장편 시네마틱 비디오를 위한 오디오-비주얼 스크립트 생성 연구

초록

현재의 멀티모달 대규모 언어 모델(MLLM)은 단편 영상 이해에서 뛰어난 능력을 보여왔으나, 장편 영화 영상을 시간적으로 정교하게 구분된 상세한 대본으로 변환하는 작업은 여전히 큰 과제로 남아 있습니다. 본 논문은 새로운 비디오-투-스크립트(V2S) 과제를 소개하며, 등장인물의 행동, 대사, 표정, 음향 효과를 포함한 계층적 장면별 대본 생성을 목표로 합니다. 이를 위해 최초의 인간 주석 기반 벤치마크를 구축하고 시간 인식 계층적 평가 프레임워크를 제안합니다. 더 나아가 장편 서사 이해에 특화된 8B 매개변수 오므니모달(시청각) 언어 모델인 OmniScript를 제시합니다. OmniScript는 플롯 및 등장인물 추론을 위한 사고 연쇄 지도 미세 조정과 이어서 시간 분할 보상을 활용한 강화 학습을 통해 점진적으로 훈련됩니다. 광범위한 실험 결과, 매개변수 효율성에도 불구하고 OmniScript가 더 큰 규모의 오픈소스 모델을 크게 앞지르며 Gemini 3-Pro를 포함한 최첨단 상용 모델에 버금가는 성능을 시간적 위치 지정 및 다중 분야 의미 정확도에서 달성함을 입증합니다.

English

Current multimodal large language models (MLLMs) have demonstrated remarkable capabilities in short-form video understanding, yet translating long-form cinematic videos into detailed, temporally grounded scripts remains a significant challenge. This paper introduces the novel video-to-script (V2S) task, aiming to generate hierarchical, scene-by-scene scripts encompassing character actions, dialogues, expressions, and audio cues. To facilitate this, we construct a first-of-its-kind human-annotated benchmark and propose a temporally-aware hierarchical evaluation framework. Furthermore, we present OmniScript, an 8B-parameter omni-modal (audio-visual) language model tailored for long-form narrative comprehension. OmniScript is trained via a progressive pipeline that leverages chain-of-thought supervised fine-tuning for plot and character reasoning, followed by reinforcement learning using temporally segmented rewards. Extensive experiments demonstrate that despite its parameter efficiency, OmniScript significantly outperforms larger open-source models and achieves performance comparable to state-of-the-art proprietary models, including Gemini 3-Pro, in both temporal localization and multi-field semantic accuracy.

OmniScript: 장편 시네마틱 비디오를 위한 오디오-비주얼 스크립트 생성 연구

OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video

초록

Support