基於音視訊記錄的多智能體遊戲生成與評估

摘要

尽管人工智能在生成文本、音频、图像和视频方面表现出色，但创造如电子游戏等互动视听内容仍面临挑战。当前的大型语言模型（LLMs）能够生成JavaScript游戏和动画，但缺乏自动化的评估指标，且在处理通常需要人类团队耗时数月（多镜头、多代理）使用艺术家制作的资产来完成的复杂内容时显得力不从心。为解决这些问题，我们构建了一套新的评估指标和一个多代理系统。我们提出了AVR-Eval，这是一种利用音频视频记录（AVRs）来相对评估多媒体内容质量的指标。一个全模态模型（处理文本、视频和音频）比较两种内容的AVRs，并由一个文本模型审查评估结果以确定优劣。我们展示了AVR-Eval能够准确区分优质内容与破损或不匹配的内容。我们开发了AVR-Agent，这是一个从多媒体资产库（音频、图像、3D模型）生成JavaScript代码的多代理系统。编码代理选择相关资产，生成多个初始代码版本，使用AVR-Eval识别最佳版本，并通过来自AVR的全模态代理反馈迭代改进。我们在游戏和动画上进行了实验，使用AVR-Eval（内容A对B的胜率）进行评估。我们发现，由AVR-Agent生成的内容相较于一次性生成的内容具有显著更高的胜率。然而，模型在有效利用定制资产和AVR反馈方面存在困难，未能展现出更高的胜率。这揭示了一个关键差距：虽然人类能够从高质量资产和视听反馈中获益，但当前的编码模型似乎未能同样有效地利用这些资源，凸显了人类与机器在内容创作方法上的根本差异。

English

While AI excels at generating text, audio, images, and videos, creating interactive audio-visual content such as video games remains challenging. Current LLMs can generate JavaScript games and animations, but lack automated evaluation metrics and struggle with complex content that normally requires teams of humans working for many months (multi-shot, multi-agents) using assets made by artists. To tackle these issues, we built a new metric and a multi-agent system. We propose AVR-Eval, a relative metric for multimedia content quality using Audio-Visual Recordings (AVRs). An omni-modal model (processing text, video, and audio) compares the AVRs of two contents, with a text model reviewing evaluations to determine superiority. We show that AVR-Eval properly identifies good from broken or mismatched content. We built AVR-Agent, a multi-agent system generating JavaScript code from a bank of multimedia assets (audio, images, 3D models). The coding agent selects relevant assets, generates multiple initial codes, uses AVR-Eval to identify the best version, and iteratively improves it through omni-modal agent feedback from the AVR. We run experiments on games and animations with AVR-Eval (win rate of content A against B). We find that content generated by AVR-Agent has a significantly higher win rate against content made through one-shot generation. However, models struggle to leverage custom assets and AVR feedback effectively, showing no higher win rate. This reveals a critical gap: while humans benefit from high-quality assets and audio-visual feedback, current coding models do not seem to utilize these resources as effectively, highlighting fundamental differences between human and machine content creation approaches.

基於音視訊記錄的多智能體遊戲生成與評估

Multi-Agent Game Generation and Evaluation via Audio-Visual Recordings

摘要

Support