UniVA:面向开源的新一代视频通用智能体
UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist
November 11, 2025
作者: Zhengyang Liang, Daoan Zhang, Huichi Zhou, Rui Huang, Bobo Li, Yuechen Zhang, Shengqiong Wu, Xiaohan Wang, Jiebo Luo, Lizi Liao, Hao Fei
cs.AI
摘要
尽管专业AI模型在视频生成或理解等独立任务中表现出色,但现实应用需要结合多种能力的复杂迭代工作流。为弥补这一鸿沟,我们推出UniVA——面向下一代视频通用模型的开源全能多智能体框架,它将视频理解、分割、编辑与生成统一为连贯的工作流。UniVA采用规划与执行双智能体架构,驱动高度自动化的工作流程:规划智能体解析用户意图并分解为结构化视频处理步骤,执行智能体则通过基于MCP的模块化工具服务器(支持分析、生成、编辑、跟踪等功能)实施操作。通过分层多级记忆机制(全局知识、任务上下文与用户偏好),UniVA支持长程推理、上下文延续及智能体间通信,实现全链路可追溯的交互式自反视频创作。该设计使迭代式任意条件视频工作流(如文本/图像/视频条件生成→多轮编辑→对象分割→组合合成)成为可能,而以往使用单功能模型或单体视频语言模型实现这些流程极为繁琐。我们还推出UniVA-Bench基准测试套件,涵盖理解、编辑、分割与生成的多步骤视频任务,用于严格评估此类智能体视频系统。UniVA与UniVA-Bench均已全面开源,旨在推动面向下一代多模态AI系统的交互式、智能体化通用视频智能研究。(https://univa.online/)
English
While specialized AI models excel at isolated video tasks like generation or understanding, real-world applications demand complex, iterative workflows that combine these capabilities. To bridge this gap, we introduce UniVA, an open-source, omni-capable multi-agent framework for next-generation video generalists that unifies video understanding, segmentation, editing, and generation into cohesive workflows. UniVA employs a Plan-and-Act dual-agent architecture that drives a highly automated and proactive workflow: a planner agent interprets user intentions and decomposes them into structured video-processing steps, while executor agents execute these through modular, MCP-based tool servers (for analysis, generation, editing, tracking, etc.). Through a hierarchical multi-level memory (global knowledge, task context, and user-specific preferences), UniVA sustains long-horizon reasoning, contextual continuity, and inter-agent communication, enabling interactive and self-reflective video creation with full traceability. This design enables iterative and any-conditioned video workflows (e.g., text/image/video-conditioned generation rightarrow multi-round editing rightarrow object segmentation rightarrow compositional synthesis) that were previously cumbersome to achieve with single-purpose models or monolithic video-language models. We also introduce UniVA-Bench, a benchmark suite of multi-step video tasks spanning understanding, editing, segmentation, and generation, to rigorously evaluate such agentic video systems. Both UniVA and UniVA-Bench are fully open-sourced, aiming to catalyze research on interactive, agentic, and general-purpose video intelligence for the next generation of multimodal AI systems. (https://univa.online/)