CutClaw:基于音乐同步技术的智能长视频剪辑系统
CutClaw: Agentic Hours-Long Video Editing via Music Synchronization
March 31, 2026
作者: Shifang Zhao, Yihan Hu, Ying Shan, Yunchao Wei, Xiaodong Cun
cs.AI
摘要
在当前社交媒体中,视频内容与音频对齐的编辑技术已形成一种数字人工艺术。然而,耗时且重复的手动视频编辑长期以来一直是影视制作人和专业内容创作者面临的挑战。本文提出CutClaw——一种自主多智能体框架,能够将数小时的原始素材剪辑成有意义的短视频。该框架通过调用多模态语言模型作为智能体系统,生成音乐同步、指令遵循且视觉表现力强的视频作品。具体而言,我们的方法首先采用分层多模态解构技术,同步捕捉视觉与音频素材的细粒度细节和全局结构;随后,为确保叙事连贯性,剧本创作智能体负责统筹整体叙事流程,构建长期叙事框架,并将视觉场景与音乐转场精准锚定;最后,编辑与审核智能体基于严谨的美学与语义标准,通过协同选择细粒度视觉内容来优化最终成片。详细实验表明,CutClaw在生成高质量节奏同步视频方面显著优于现有最优基准方法。代码已开源:https://github.com/GVCLab/CutClaw。
English
Editing the video content with audio alignment forms a digital human-made art in current social media. However, the time-consuming and repetitive nature of manual video editing has long been a challenge for filmmakers and professional content creators alike. In this paper, we introduce CutClaw, an autonomous multi-agent framework designed to edit hours-long raw footage into meaningful short videos that leverages the capabilities of multiple Multimodal Language Models~(MLLMs) as an agent system. It produces videos with synchronized music, followed by instructions, and a visually appealing appearance. In detail, our approach begins by employing a hierarchical multimodal decomposition that captures both fine-grained details and global structures across visual and audio footage. Then, to ensure narrative consistency, a Playwriter Agent orchestrates the whole storytelling flow and structures the long-term narrative, anchoring visual scenes to musical shifts. Finally, to construct a short edited video, Editor and Reviewer Agents collaboratively optimize the final cut via selecting fine-grained visual content based on rigorous aesthetic and semantic criteria. We conduct detailed experiments to demonstrate that CutClaw significantly outperforms state-of-the-art baselines in generating high-quality, rhythm-aligned videos. The code is available at: https://github.com/GVCLab/CutClaw.