語音識別中的凸性低資源口音魯棒語言檢測

摘要

全球化與多元文化持續催生出日益多樣的語音變體。然而，現有的口語對話系統在處理代表性不足的方言與口音時經常失敗，常誤判輸入語言，導致下游對話任務出現串聯式錯誤。在低資源限制下解決此類方言變異問題仍是開放的挑戰，因為標準的微調方法不僅計算成本高昂，且易在高維度語音資料上過度擬合。我們提出凸性語言偵測（Convex Language Detection, CLD），一個將理論根基穩固的凸優化技術整合至口語對話系統管線的新型框架。我們的方法透過 JAX 中的多 GPU 交替方向乘子法（Alternating Direction Method of Multipliers, ADMM）高效實現，從而提供全局最優性保證，並在多項式時間內完成快速訓練。在理論上，我們證明凸性目標函數可導出認證的邊際穩定性，並提供對抗特徵擾動的保證。在實驗上，我們展示出樣本效率與對輸入方言變異的穩健性，在挑戰性的低資源環境中達到 97-98% 的準確率。我們的開源套件可於 https://pypi.org/project/jaxcld/ 取得。

English

Globalization and multiculturalism continue to produce increasingly diverse speech varieties. Yet current spoken dialogue systems frequently fail on under-represented dialects and accents, often misidentifying the input language and causing cascading failures in downstream dialogue tasks. Addressing this dialectal variance under low-resource constraints remains an open challenge, as standard fine-tuning is computationally expensive and prone to overfitting on high-dimensional speech data. We propose Convex Language Detection (CLD), a novel framework that integrates theoretically grounded convex optimization techniques into the spoken dialogue systems pipeline. Our method is efficiently implemented via multi-GPU Alternating Direction Method of Multipliers (ADMM) in JAX, thus providing global optimality guarantees and fast training in polynomial time. Theoretically, we prove that our convex objective induces certified margin stability and provide guarantees against feature perturbations. Empirically, we demonstrate sample efficiency and robustness to input dialectical variation, achieving 97-98% accuracy in challenging low-resource regimes. Our open-source package is available at https://pypi.org/project/jaxcld/