マルチエージェントゲーム生成と評価：オーディオビジュアル記録を介して

要旨

AIはテキスト、音声、画像、動画の生成において優れているが、ビデオゲームのようなインタラクティブなオーディオビジュアルコンテンツの作成は依然として課題である。現在の大規模言語モデル（LLM）はJavaScriptゲームやアニメーションを生成できるが、自動評価指標が欠如しており、通常は人間のチームが数か月かけて作成する複雑なコンテンツ（マルチショット、マルチエージェント）やアーティストが作成したアセットを扱うことが難しい。これらの問題に対処するため、我々は新しい評価指標とマルチエージェントシステムを構築した。我々は、オーディオビジュアル記録（AVR）を用いたマルチメディアコンテンツの品質を評価する相対的指標であるAVR-Evalを提案する。オムニモーダルモデル（テキスト、ビデオ、音声を処理）が2つのコンテンツのAVRを比較し、テキストモデルが評価をレビューして優劣を決定する。AVR-Evalが正常なコンテンツと壊れたまたはミスマッチしたコンテンツを適切に識別することを示す。我々は、マルチメディアアセット（音声、画像、3Dモデル）のバンクからJavaScriptコードを生成するマルチエージェントシステムであるAVR-Agentを構築した。コーディングエージェントは関連するアセットを選択し、複数の初期コードを生成し、AVR-Evalを使用して最良のバージョンを特定し、AVRからのオムニモーダルエージェントのフィードバックを通じて反復的に改善する。我々は、AVR-Evalを使用してゲームとアニメーションの実験を行い（コンテンツA対Bの勝率）、AVR-Agentによって生成されたコンテンツがワンショット生成によるコンテンツに対して有意に高い勝率を持つことを見出した。しかし、モデルはカスタムアセットとAVRフィードバックを効果的に活用することができず、勝率の向上は見られなかった。これは重要なギャップを明らかにしている：人間は高品質なアセットとオーディオビジュアルフィードバックから利益を得るが、現在のコーディングモデルはこれらのリソースを効果的に活用していない。これは、人間と機械のコンテンツ作成アプローチの根本的な違いを示している。

English

While AI excels at generating text, audio, images, and videos, creating interactive audio-visual content such as video games remains challenging. Current LLMs can generate JavaScript games and animations, but lack automated evaluation metrics and struggle with complex content that normally requires teams of humans working for many months (multi-shot, multi-agents) using assets made by artists. To tackle these issues, we built a new metric and a multi-agent system. We propose AVR-Eval, a relative metric for multimedia content quality using Audio-Visual Recordings (AVRs). An omni-modal model (processing text, video, and audio) compares the AVRs of two contents, with a text model reviewing evaluations to determine superiority. We show that AVR-Eval properly identifies good from broken or mismatched content. We built AVR-Agent, a multi-agent system generating JavaScript code from a bank of multimedia assets (audio, images, 3D models). The coding agent selects relevant assets, generates multiple initial codes, uses AVR-Eval to identify the best version, and iteratively improves it through omni-modal agent feedback from the AVR. We run experiments on games and animations with AVR-Eval (win rate of content A against B). We find that content generated by AVR-Agent has a significantly higher win rate against content made through one-shot generation. However, models struggle to leverage custom assets and AVR feedback effectively, showing no higher win rate. This reveals a critical gap: while humans benefit from high-quality assets and audio-visual feedback, current coding models do not seem to utilize these resources as effectively, highlighting fundamental differences between human and machine content creation approaches.

マルチエージェントゲーム生成と評価：オーディオビジュアル記録を介して

Multi-Agent Game Generation and Evaluation via Audio-Visual Recordings

要旨

Support