多智能体游戏生成与基于音视频记录的评估

摘要

尽管人工智能在生成文本、音频、图像和视频方面表现出色，但创作如电子游戏这样的互动视听内容仍具挑战性。当前的大型语言模型（LLMs）虽能生成JavaScript游戏和动画，却缺乏自动化评估指标，且在处理通常需要人类团队耗时数月（多镜头、多智能体）并借助艺术家制作的素材才能完成的复杂内容时显得力不从心。为应对这些难题，我们构建了一套新指标及一个多智能体系统。我们提出了AVR-Eval，这是一种利用视听记录（AVRs）来相对评估多媒体内容质量的指标。一个全模态模型（处理文本、视频和音频）比较两段内容的AVRs，并由一个文本模型审核评估结果以判定优劣。我们证明，AVR-Eval能准确区分优质内容与破损或不相匹配的内容。我们开发了AVR-Agent，这是一个从多媒体素材库（音频、图像、3D模型）生成JavaScript代码的多智能体系统。编码智能体选取相关素材，生成多个初始代码版本，利用AVR-Eval筛选出最佳版本，并通过来自AVR的全模态智能体反馈进行迭代优化。我们在游戏和动画上进行了实验，使用AVR-Eval（内容A对B的胜率）进行评估。结果表明，由AVR-Agent生成的内容相较于一次性生成的内容，其胜率显著提高。然而，模型在有效利用定制素材和AVR反馈方面存在困难，未能展现出更高的胜率。这揭示了一个关键差距：尽管人类能从高质量素材和视听反馈中获益，当前的编码模型似乎未能同样高效地利用这些资源，凸显了人类与机器在内容创作方法上的根本差异。

English

While AI excels at generating text, audio, images, and videos, creating interactive audio-visual content such as video games remains challenging. Current LLMs can generate JavaScript games and animations, but lack automated evaluation metrics and struggle with complex content that normally requires teams of humans working for many months (multi-shot, multi-agents) using assets made by artists. To tackle these issues, we built a new metric and a multi-agent system. We propose AVR-Eval, a relative metric for multimedia content quality using Audio-Visual Recordings (AVRs). An omni-modal model (processing text, video, and audio) compares the AVRs of two contents, with a text model reviewing evaluations to determine superiority. We show that AVR-Eval properly identifies good from broken or mismatched content. We built AVR-Agent, a multi-agent system generating JavaScript code from a bank of multimedia assets (audio, images, 3D models). The coding agent selects relevant assets, generates multiple initial codes, uses AVR-Eval to identify the best version, and iteratively improves it through omni-modal agent feedback from the AVR. We run experiments on games and animations with AVR-Eval (win rate of content A against B). We find that content generated by AVR-Agent has a significantly higher win rate against content made through one-shot generation. However, models struggle to leverage custom assets and AVR feedback effectively, showing no higher win rate. This reveals a critical gap: while humans benefit from high-quality assets and audio-visual feedback, current coding models do not seem to utilize these resources as effectively, highlighting fundamental differences between human and machine content creation approaches.

多智能体游戏生成与基于音视频记录的评估

Multi-Agent Game Generation and Evaluation via Audio-Visual Recordings

摘要

Support