AAD-LLM：ニューラルアテンション駆動型聴覚シーン理解

要旨

聴覚基盤モデル、特に聴覚大規模言語モデル（LLM）は、リスナーの知覚とは独立して、すべての音声入力を均等に処理します。しかし、人間の聴覚知覚は本質的に選択的です：リスナーは複雑な聴覚シーンにおいて特定の話者に焦点を当て、他の話者を無視します。既存のモデルはこの選択性を取り入れていないため、知覚に沿った応答を生成する能力が制限されています。この問題に対処するため、我々は意図を考慮した聴覚シーン理解（II-ASU）を導入し、リスナーの注意を推測するために脳信号を統合したプロトタイプシステムである聴覚注意駆動型LLM（AAD-LLM）を提案します。AAD-LLMは、頭蓋内脳波（iEEG）記録を組み込むことで、リスナーがどの話者に注意を向けているかをデコードし、それに応じて応答を精緻化するように聴覚LLMを拡張します。このモデルはまず、神経活動から注意を向けている話者を予測し、次にこの推測された注意状態に基づいて応答生成を行います。我々はAAD-LLMを、複数話者シナリオにおける話者記述、音声書き起こしと抽出、質問応答について評価し、客観的および主観的評価の両方でリスナーの意図との整合性が向上することを示しました。意図を意識した聴覚AIへの第一歩を踏み出すことで、この研究はリスナーの知覚が機械聴取を導く新たなパラダイムを探求し、将来的なリスナー中心の聴覚システムへの道を開きます。デモとコードは以下で利用可能です：https://aad-llm.github.io。

English

Auditory foundation models, including auditory large language models (LLMs), process all sound inputs equally, independent of listener perception. However, human auditory perception is inherently selective: listeners focus on specific speakers while ignoring others in complex auditory scenes. Existing models do not incorporate this selectivity, limiting their ability to generate perception-aligned responses. To address this, we introduce Intention-Informed Auditory Scene Understanding (II-ASU) and present Auditory Attention-Driven LLM (AAD-LLM), a prototype system that integrates brain signals to infer listener attention. AAD-LLM extends an auditory LLM by incorporating intracranial electroencephalography (iEEG) recordings to decode which speaker a listener is attending to and refine responses accordingly. The model first predicts the attended speaker from neural activity, then conditions response generation on this inferred attentional state. We evaluate AAD-LLM on speaker description, speech transcription and extraction, and question answering in multitalker scenarios, with both objective and subjective ratings showing improved alignment with listener intention. By taking a first step toward intention-aware auditory AI, this work explores a new paradigm where listener perception informs machine listening, paving the way for future listener-centered auditory systems. Demo and code available: https://aad-llm.github.io.

AAD-LLM：ニューラルアテンション駆動型聴覚シーン理解

AAD-LLM: Neural Attention-Driven Auditory Scene Understanding

要旨

Support