UniVA:邁向開源新世代的通用影片智能體
UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist
November 11, 2025
作者: Zhengyang Liang, Daoan Zhang, Huichi Zhou, Rui Huang, Bobo Li, Yuechen Zhang, Shengqiong Wu, Xiaohan Wang, Jiebo Luo, Lizi Liao, Hao Fei
cs.AI
摘要
雖然專業化的人工智慧模型在影片生成或理解等單一任務上表現卓越,但現實應用需要結合多種能力的複雜迭代工作流程。為彌合這一鴻溝,我們推出UniVA——一個開源的全能型多智能體框架,專為新一代影片通用系統設計,將影片理解、分割、編輯與生成統一為連貫的工作流。UniVA採用「規劃-執行」雙智能體架構,驅動高度自動化的主動工作流程:規劃智能體解析用戶意圖並將其分解為結構化影片處理步驟,而執行智能體則通過基於模組化MCP的工具伺服器(用於分析、生成、編輯、追蹤等)實施這些步驟。透過分層多級記憶系統(全局知識、任務上下文與用戶特定偏好),UniVA支持長程推理、上下文連續性與智能體間通信,實現具備完整可追溯性的互動式自反思影片創作。此設計使迭代式任意條件影片工作流(例如文字/圖像/影片條件生成→多輪編輯→物件分割→組合合成)成為可能,而以往使用單一功能模型或整體式影片語言模型難以實現這類流程。我們同時推出UniVA-Bench基準測試套件,涵蓋理解、編輯、分割與生成的多步驟影片任務,用於嚴格評估此類智能體化影片系統。UniVA與UniVA-Bench均完全開源,旨在推動下一代多模態AI系統中互動式、智能體化與通用型影片智能的研究進程。(https://univa.online/)
English
While specialized AI models excel at isolated video tasks like generation or understanding, real-world applications demand complex, iterative workflows that combine these capabilities. To bridge this gap, we introduce UniVA, an open-source, omni-capable multi-agent framework for next-generation video generalists that unifies video understanding, segmentation, editing, and generation into cohesive workflows. UniVA employs a Plan-and-Act dual-agent architecture that drives a highly automated and proactive workflow: a planner agent interprets user intentions and decomposes them into structured video-processing steps, while executor agents execute these through modular, MCP-based tool servers (for analysis, generation, editing, tracking, etc.). Through a hierarchical multi-level memory (global knowledge, task context, and user-specific preferences), UniVA sustains long-horizon reasoning, contextual continuity, and inter-agent communication, enabling interactive and self-reflective video creation with full traceability. This design enables iterative and any-conditioned video workflows (e.g., text/image/video-conditioned generation rightarrow multi-round editing rightarrow object segmentation rightarrow compositional synthesis) that were previously cumbersome to achieve with single-purpose models or monolithic video-language models. We also introduce UniVA-Bench, a benchmark suite of multi-step video tasks spanning understanding, editing, segmentation, and generation, to rigorously evaluate such agentic video systems. Both UniVA and UniVA-Bench are fully open-sourced, aiming to catalyze research on interactive, agentic, and general-purpose video intelligence for the next generation of multimodal AI systems. (https://univa.online/)