音声認識における凸最適化に基づく低リソース・アクセント頑健言語検出

要旨

グローバル化と多文化主義の進展により、言語変種はますます多様化している。しかし、現在の音声対話システムは、代表性の低い方言やアクセントに対してしばしば失敗し、入力言語を誤認することで、下流の対話タスクに連鎖的な障害を引き起こす。このような方言変異に対処することは、低リソース条件下では依然として未解決の課題であり、標準的なファインチューニングは計算コストが高く、高次元の音声データに対して過学習しやすい。本稿では、音声対話システムのパイプラインに理論的に基づいた凸最適化手法を統合する、新たなフレームワーク「凸言語検出（CLD）」を提案する。本手法は、JAXにおけるマルチGPU向け交互方向乗数法（ADMM）により効率的に実装され、大域的最適性の保証と多項式時間での高速な学習を実現する。理論的には、凸目的関数が認証されたマージン安定性を導くことを証明し、特徴量摂動に対する保証を提供する。実証的には、サンプル効率と入力方言変異に対する頑健性を示し、困難な低リソース環境下で97～98%の精度を達成する。オープンソースパッケージは https://pypi.org/project/jaxcld/ で入手可能である。

English

Globalization and multiculturalism continue to produce increasingly diverse speech varieties. Yet current spoken dialogue systems frequently fail on under-represented dialects and accents, often misidentifying the input language and causing cascading failures in downstream dialogue tasks. Addressing this dialectal variance under low-resource constraints remains an open challenge, as standard fine-tuning is computationally expensive and prone to overfitting on high-dimensional speech data. We propose Convex Language Detection (CLD), a novel framework that integrates theoretically grounded convex optimization techniques into the spoken dialogue systems pipeline. Our method is efficiently implemented via multi-GPU Alternating Direction Method of Multipliers (ADMM) in JAX, thus providing global optimality guarantees and fast training in polynomial time. Theoretically, we prove that our convex objective induces certified margin stability and provide guarantees against feature perturbations. Empirically, we demonstrate sample efficiency and robustness to input dialectical variation, achieving 97-98% accuracy in challenging low-resource regimes. Our open-source package is available at https://pypi.org/project/jaxcld/