ギャップに注意！大規模音声モデルの静的評価とインタラクティブ評価

要旨

AIチャットボットが普及する中、音声インタラクションは、意味的・社会的シグナル双方の迅速かつ高帯域幅のコミュニケーションを可能にする魅力的な方法として注目されています。これにより、音声ネイティブな体験を実現するための大規模音声モデル（LAM）の研究が推進されています。しかし、LAMの開発をユーザーの目標に合わせるためには、信頼性のある進捗指標を確立するために、ユーザーのニーズと嗜好を明確に理解する必要があります。本研究では、これらの課題に対処するため、LAMを評価するためのインタラクティブなアプローチを導入し、484名の参加者から7,500件のLAMインタラクションを収集しました。ユーザークエリのトピックモデリングを通じて、音声インターフェースの主要なユースケースを特定しました。次に、ユーザーの嗜好順位と定性的フィードバックを分析し、どのモデルがユーザーのニーズに最も合致しているかを明らかにしました。最後に、静的ベンチマークがインタラクティブな性能をどの程度予測するかを評価しました。その結果、いずれのベンチマークもインタラクティブな結果と強い相関を示さないことが判明しました（すべてのベンチマークでtau ≤ 0.33）。複数の粗粒度な特徴を組み合わせることで、ある程度の予測力が得られるものの（R^2=0.30）、音声質問応答と年齢予測に関する20のデータセットのうち、有意な正の相関を示すのは2つだけでした。この結果は、ユーザーの嗜好とより強く相関するLAM評価手法の開発が明らかに必要であることを示唆しています。

English

As AI chatbots become ubiquitous, voice interaction presents a compelling way to enable rapid, high-bandwidth communication for both semantic and social signals. This has driven research into Large Audio Models (LAMs) to power voice-native experiences. However, aligning LAM development with user goals requires a clear understanding of user needs and preferences to establish reliable progress metrics. This study addresses these challenges by introducing an interactive approach to evaluate LAMs and collecting 7,500 LAM interactions from 484 participants. Through topic modeling of user queries, we identify primary use cases for audio interfaces. We then analyze user preference rankings and qualitative feedback to determine which models best align with user needs. Finally, we evaluate how static benchmarks predict interactive performance - our analysis reveals no individual benchmark strongly correlates with interactive results (tau leq 0.33 for all benchmarks). While combining multiple coarse-grained features yields modest predictive power (R^2=0.30), only two out of twenty datasets on spoken question answering and age prediction show significantly positive correlations. This suggests a clear need to develop LAM evaluations that better correlate with user preferences.

ギャップに注意！大規模音声モデルの静的評価とインタラクティブ評価

Mind the Gap! Static and Interactive Evaluations of Large Audio Models

要旨

Support