음성 인식에서 볼록 저자원 악센트-강건 언어 검출

초록

세계화와 다문화주의는 점점 더 다양한 언어 변이형을 생성하고 있다. 그러나 현재의 음성 대화 시스템은 과소 대표된 방언과 억양에서 자주 실패하며, 입력 언어를 잘못 식별하여 하류 대화 작업에서 연쇄적 실패를 초래한다. 저자원 제약 하에서 이러한 방언 변이를 해결하는 것은 여전히 해결되지 않은 과제로 남아 있는데, 표준 미세 조정은 계산 비용이 많이 들고 고차원 음성 데이터에 과적합되기 쉽기 때문이다. 본 논문에서는 이론적으로 정립된 볼록 최적화 기법을 음성 대화 시스템 파이프라인에 통합한 새로운 프레임워크인 Convex Language Detection (CLD)을 제안한다. 본 방법은 JAX에서 다중 GPU 교번 방향 승수법(ADMM)을 통해 효율적으로 구현되어 전역 최적성 보장과 다항식 시간 내 빠른 학습을 제공한다. 이론적으로, 우리는 볼록 목적 함수가 인증된 마진 안정성을 유도하며 특징 섭동에 대한 보장을 제공함을 증명한다. 실험적으로, 우리는 샘플 효율성과 입력 방언 변이에 대한 강건성을 입증하며, 까다로운 저자원 환경에서 97-98%의 정확도를 달성한다. 오픈소스 패키지는 https://pypi.org/project/jaxcld/에서 제공된다.

English

Globalization and multiculturalism continue to produce increasingly diverse speech varieties. Yet current spoken dialogue systems frequently fail on under-represented dialects and accents, often misidentifying the input language and causing cascading failures in downstream dialogue tasks. Addressing this dialectal variance under low-resource constraints remains an open challenge, as standard fine-tuning is computationally expensive and prone to overfitting on high-dimensional speech data. We propose Convex Language Detection (CLD), a novel framework that integrates theoretically grounded convex optimization techniques into the spoken dialogue systems pipeline. Our method is efficiently implemented via multi-GPU Alternating Direction Method of Multipliers (ADMM) in JAX, thus providing global optimality guarantees and fast training in polynomial time. Theoretically, we prove that our convex objective induces certified margin stability and provide guarantees against feature perturbations. Empirically, we demonstrate sample efficiency and robustness to input dialectical variation, achieving 97-98% accuracy in challenging low-resource regimes. Our open-source package is available at https://pypi.org/project/jaxcld/