PodAgent: ポッドキャスト生成のための包括的フレームワーク

要旨

既存の自動音声生成手法では、ポッドキャストのような音声プログラムを効果的に生成することが困難です。主な課題は、深みのあるコンテンツ生成と、適切で表現力豊かな音声生成にあります。本論文では、音声プログラムを作成するための包括的なフレームワークであるPodAgentを提案します。PodAgentは、1) ホスト、ゲスト、ライターのマルチエージェント協調システムを設計することで情報豊かなトピックディスカッションコンテンツを生成し、2) 適切な音声と役割のマッチングを行うための音声プールを構築し、3) LLMを活用した音声合成手法を用いて表現力豊かな会話音声を生成します。ポッドキャストのような音声生成に対する標準化された評価基準が存在しないことを踏まえ、モデルの性能を効果的に評価するための包括的な評価ガイドラインを開発しました。実験結果は、PodAgentの有効性を示しており、トピックディスカッションの対話コンテンツにおいて直接GPT-4を生成する手法を大幅に上回り、87.4%の音声マッチング精度を達成し、LLMを活用した合成によりより表現力豊かな音声を生成しました。デモページ: https://podcast-agent.github.io/demo/。ソースコード: https://github.com/yujxx/PodAgent。

English

Existing Existing automatic audio generation methods struggle to generate podcast-like audio programs effectively. The key challenges lie in in-depth content generation, appropriate and expressive voice production. This paper proposed PodAgent, a comprehensive framework for creating audio programs. PodAgent 1) generates informative topic-discussion content by designing a Host-Guest-Writer multi-agent collaboration system, 2) builds a voice pool for suitable voice-role matching and 3) utilizes LLM-enhanced speech synthesis method to generate expressive conversational speech. Given the absence of standardized evaluation criteria for podcast-like audio generation, we developed comprehensive assessment guidelines to effectively evaluate the model's performance. Experimental results demonstrate PodAgent's effectiveness, significantly surpassing direct GPT-4 generation in topic-discussion dialogue content, achieving an 87.4% voice-matching accuracy, and producing more expressive speech through LLM-guided synthesis. Demo page: https://podcast-agent.github.io/demo/. Source code: https://github.com/yujxx/PodAgent.

PodAgent: ポッドキャスト生成のための包括的フレームワーク

PodAgent: A Comprehensive Framework for Podcast Generation

要旨

Support