SQuTR:聲學噪音下語音查詢文本檢索的魯棒性基準測試
SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise
February 13, 2026
作者: Yuejie Li, Ke Yang, Yueying Hua, Berlin Chen, Jianhao Nie, Yueping He, Caixin Kang
cs.AI
摘要
語音查詢檢索是現代資訊檢索中的重要互動模式。然而,現有評估數據集通常僅包含受限噪聲條件下的簡單查詢,難以全面評估語音查詢檢索系統在複雜聲學干擾下的魯棒性。為解決此局限性,我們提出SQuTR——一個包含大規模數據集與統一評估協議的語音查詢檢索魯棒性基準。SQuTR彙總了來自六個常用中英文文本檢索數據集的37,317條獨特查詢,涵蓋多領域與多樣化查詢類型。我們採用200名真實說話者的語音特徵合成語音,並在可控信噪比下混合17類真實環境噪聲,實現了從靜謐到極高噪聲環境的可復現魯棒性評估。基於統一協議,我們對代表性級聯式與端到端檢索系統進行大規模評估。實驗結果表明:檢索性能隨噪聲增強而下降,且不同系統的性能衰減幅度差異顯著。即使大規模檢索模型在極端噪聲下也表現不佳,表明魯棒性仍是關鍵瓶頸。總體而言,SQuTR為基準測試與診斷分析提供了可復現的實驗平台,並將推動語音查詢至文本檢索魯棒性研究的未來發展。
English
Spoken query retrieval is an important interaction mode in modern information retrieval. However, existing evaluation datasets are often limited to simple queries under constrained noise conditions, making them inadequate for assessing the robustness of spoken query retrieval systems under complex acoustic perturbations. To address this limitation, we present SQuTR, a robustness benchmark for spoken query retrieval that includes a large-scale dataset and a unified evaluation protocol. SQuTR aggregates 37,317 unique queries from six commonly used English and Chinese text retrieval datasets, spanning multiple domains and diverse query types. We synthesize speech using voice profiles from 200 real speakers and mix 17 categories of real-world environmental noise under controlled SNR levels, enabling reproducible robustness evaluation from quiet to highly noisy conditions. Under the unified protocol, we conduct large-scale evaluations on representative cascaded and end-to-end retrieval systems. Experimental results show that retrieval performance decreases as noise increases, with substantially different drops across systems. Even large-scale retrieval models struggle under extreme noise, indicating that robustness remains a critical bottleneck. Overall, SQuTR provides a reproducible testbed for benchmarking and diagnostic analysis, and facilitates future research on robustness in spoken query to text retrieval.