Aurora: 도구 사용 에이전트를 활용한 통합 동영상 편집

초록

최근 비디오 편집 모델들은 통합된 조건화 설계로 수렴하고 있다: 단일 확산 트랜스포머가 텍스트, 원본 비디오, 참조 이미지를 공동으로 소비하며, 하나의 가중치 세트가 대체, 제거, 스타일 전이, 참조 기반 삽입을 모두 처리한다. 이 설계는 유연하지만, 사용자가 이미 모델에 적합한 텍스트, 참조 이미지, 그리고 지역 편집을 위한 공간적 근거를 제공한다고 가정하는데, 실제 요청에서는 이러한 정보가 종종 생략된다. 우리는 도구 증강 시각-언어 모델(VLM) 에이전트를 통합 비디오 확산 트랜스포머와 짝지은 에이전틱 비디오 편집 프레임워크인 Aurora를 제시한다. VLM 에이전트는 원시 사용자 요청을 트랜스포머의 조건화 채널에 정렬된 구조화된 편집 계획으로 매핑하여, 생성 이전에 텍스트적 및 시각적 불완전 명세를 해결한다. 우리는 완전한 편집 계획 및 참조 이미지 선택을 위한 지도 데이터와 함께, 강건한 도구 사용 및 명령어 개선을 위한 선호도 쌍을 사용하여 VLM 에이전트를 훈련한다. 텍스트적 및 시각적 불완전 명세 하에서 에이전트 기반 비디오 편집을 평가하기 위해 AgentEdit-Bench를 도입한다. AgentEdit-Bench와 두 개의 기존 비디오 편집 벤치마크에 대한 실험은 Aurora가 명령어 기반 베이스라인보다 성능을 개선하며, VLM 에이전트가 호환 가능한 고정 비디오 편집 모델로 전이됨을 보여준다. 프로젝트 페이지: https://yeates.github.io/Aurora-Page

English

Recent video editing models have converged on a unified conditioning design: a single diffusion transformer jointly consumes text, source video, and reference images, and one set of weights covers replacement, removal, style transfer, and reference-driven insertion. The design is flexible, but it assumes that the user already provides model-ready text, reference images, and spatial grounding for local edits, which real requests often omit. We present Aurora, an agentic video editing framework that pairs a tool-augmented vision-language model (VLM) agent with a unified video diffusion transformer. The VLM agent maps a raw user request to a structured edit plan aligned with the transformer's conditioning channels, thereby resolving textual and visual underspecification before generation. We train the VLM agent with supervised data for complete edit planning and reference-image selection, together with preference pairs for robust tool use and instruction refinement. We introduce AgentEdit-Bench to evaluate agent-enhanced video editing under textual and visual underspecification. Experiments on AgentEdit-Bench and two existing video editing benchmarks show that Aurora improves over instruction-only baselines and that the VLM agent transfers to compatible frozen video editing models. Project page: https://yeates.github.io/Aurora-Page