ChatPaper.aiChatPaper

CutClaw:基于音乐同步的智能长时视频剪辑系统

CutClaw: Agentic Hours-Long Video Editing via Music Synchronization

March 31, 2026
作者: Shifang Zhao, Yihan Hu, Ying Shan, Yunchao Wei, Xiaodong Cun
cs.AI

摘要

在當今社交媒體環境中,結合音頻對齊的視頻內容剪輯已形成數字化人工藝術。然而,手動視頻剪輯耗時重複的特性,長期以來始終是影視製作者與專業內容創作者面臨的挑戰。本文提出CutClaw——一種能將數小時原始素材自主剪輯成有意義短視頻的多智能體框架,該框架通過調用多模態語言模型作為智能體系統,可生成音樂同步、指令遵循且視覺效果出眾的視頻作品。具體而言,我們的方法首先採用分層多模態解構技術,同步捕捉視聽素材中的細粒度細節與全局結構;為確保敘事連貫性,劇本編排智能體會統籌整體故事流線,構建長敘事框架並將視覺場景錨定於音樂節奏轉換點;最後通過編輯與審核智能體的協作機制,基於嚴謹的美學與語義標準篩選細粒度視覺內容,共同優化最終成片。詳盡實驗表明,CutClaw在生成高質量節奏同步視頻方面顯著優於現有頂尖基準方法。代碼已開源於:https://github.com/GVCLab/CutClaw。
English
Editing the video content with audio alignment forms a digital human-made art in current social media. However, the time-consuming and repetitive nature of manual video editing has long been a challenge for filmmakers and professional content creators alike. In this paper, we introduce CutClaw, an autonomous multi-agent framework designed to edit hours-long raw footage into meaningful short videos that leverages the capabilities of multiple Multimodal Language Models~(MLLMs) as an agent system. It produces videos with synchronized music, followed by instructions, and a visually appealing appearance. In detail, our approach begins by employing a hierarchical multimodal decomposition that captures both fine-grained details and global structures across visual and audio footage. Then, to ensure narrative consistency, a Playwriter Agent orchestrates the whole storytelling flow and structures the long-term narrative, anchoring visual scenes to musical shifts. Finally, to construct a short edited video, Editor and Reviewer Agents collaboratively optimize the final cut via selecting fine-grained visual content based on rigorous aesthetic and semantic criteria. We conduct detailed experiments to demonstrate that CutClaw significantly outperforms state-of-the-art baselines in generating high-quality, rhythm-aligned videos. The code is available at: https://github.com/GVCLab/CutClaw.
PDF281April 2, 2026