ChatPaper.aiChatPaper

留意差距!大型音頻模型的靜態與互動式評估

Mind the Gap! Static and Interactive Evaluations of Large Audio Models

February 21, 2025
作者: Minzhi Li, William Barr Held, Michael J Ryan, Kunat Pipatanakul, Potsawee Manakul, Hao Zhu, Diyi Yang
cs.AI

摘要

隨著AI聊天機器人日益普及,語音互動提供了一種引人注目的方式,能夠實現快速、高頻寬的溝通,無論是語義還是社交信號的傳遞。這推動了大型音頻模型(LAMs)的研究,以驅動原生語音體驗。然而,要使LAM的發展與用戶目標保持一致,需要清晰地理解用戶需求與偏好,從而建立可靠的進展評估指標。本研究透過引入一種互動式方法來評估LAM,並從484名參與者中收集了7,500次LAM互動,來應對這些挑戰。透過對用戶查詢的主題建模,我們識別出音頻介面的主要使用場景。接著,我們分析用戶偏好排名與質性反饋,以確定哪些模型最符合用戶需求。最後,我們評估靜態基準測試如何預測互動表現——我們的分析顯示,沒有任何單一基準測試與互動結果有強烈相關性(所有基準測試的tau ≤ 0.33)。雖然結合多個粗粒度特徵能帶來一定的預測能力(R^2=0.30),但在二十個關於口語問答與年齡預測的數據集中,僅有兩個顯示出顯著的正相關。這表明,開發更能反映用戶偏好的LAM評估方法具有明確的必要性。
English
As AI chatbots become ubiquitous, voice interaction presents a compelling way to enable rapid, high-bandwidth communication for both semantic and social signals. This has driven research into Large Audio Models (LAMs) to power voice-native experiences. However, aligning LAM development with user goals requires a clear understanding of user needs and preferences to establish reliable progress metrics. This study addresses these challenges by introducing an interactive approach to evaluate LAMs and collecting 7,500 LAM interactions from 484 participants. Through topic modeling of user queries, we identify primary use cases for audio interfaces. We then analyze user preference rankings and qualitative feedback to determine which models best align with user needs. Finally, we evaluate how static benchmarks predict interactive performance - our analysis reveals no individual benchmark strongly correlates with interactive results (tau leq 0.33 for all benchmarks). While combining multiple coarse-grained features yields modest predictive power (R^2=0.30), only two out of twenty datasets on spoken question answering and age prediction show significantly positive correlations. This suggests a clear need to develop LAM evaluations that better correlate with user preferences.

Summary

AI-Generated Summary

PDF42February 25, 2025