음성-수식 변환: 음성으로 표현된 수식 및 문장 변환을 위한 새로운 모델과 데이터셋

초록

구어로 표현된 수학식을 변환하는 작업은 음성을 엄격하게 구조화된 기호 표현으로 전사하면서도 방정식 발음에 내재된 모호성을 해결해야 하는 어려운 과제이다. 자동 음성 인식(ASR)과 언어 모델(LM) 분야에서 상당한 진전이 이루어졌음에도 불구하고, 구어 수학식을 LaTeX으로 변환하는 문제는 아직 충분히 탐구되지 않았다. 이 작업은 강의 전사나 노트 작성과 같은 교육 및 연구 분야에 직접적으로 적용될 수 있다. ASR 후처리를 기반으로 한 기존 연구는 2번의 전사를 요구하며, 고립된 방정식에만 초점을 맞추고, 제한된 테스트 세트를 사용하며, 훈련 데이터나 다국어 지원을 제공하지 않는다. 이러한 문제를 해결하기 위해, 우리는 영어와 러시아어로 된 66,000개 이상의 인간 주석이 달린 수학 방정식 및 문장의 오디오 샘플로 구성된 첫 번째 완전한 오픈소스 대규모 데이터셋을 제시한다. 이 데이터셋은 다양한 과학 분야에서 추출되었다. ASR 후처리 모델과 퓨샷 프롬프팅 외에도, 우리는 오디오 언어 모델을 적용하여 MathSpeech 벤치마크에서 방정식 변환에 대해 비교 가능한 문자 오류율(CER) 결과를 보여준다(28% 대 30%). 반면, 제안된 S2L-방정식 벤치마크에서는 LaTeX 포맷팅 아티팩트를 고려한 후에도 우리의 모델이 MathSpeech 모델을 40% 이상의 큰 차이로 능가한다(27% 대 64%). 우리는 수학 문장 인식을 위한 첫 번째 벤치마크(S2L-문장)를 구축하고 40%의 방정식 CER을 달성한다. 이 작업은 수학 콘텐츠 인식에 초점을 맞춘 다중 모드 AI의 미래 발전을 위한 기반을 마련한다.

English

Conversion of spoken mathematical expressions is a challenging task that involves transcribing speech into a strictly structured symbolic representation while addressing the ambiguity inherent in the pronunciation of equations. Although significant progress has been achieved in automatic speech recognition (ASR) and language models (LM), the problem of converting spoken mathematics into LaTeX remains underexplored. This task directly applies to educational and research domains, such as lecture transcription or note creation. Based on ASR post-correction, prior work requires 2 transcriptions, focuses only on isolated equations, has a limited test set, and provides neither training data nor multilingual coverage. To address these issues, we present the first fully open-source large-scale dataset, comprising over 66,000 human-annotated audio samples of mathematical equations and sentences in both English and Russian, drawn from diverse scientific domains. In addition to the ASR post-correction models and few-shot prompting, we apply audio language models, demonstrating comparable character error rate (CER) results on the MathSpeech benchmark (28% vs. 30%) for the equations conversion. In contrast, on the proposed S2L-equations benchmark, our models outperform the MathSpeech model by a substantial margin of more than 40 percentage points, even after accounting for LaTeX formatting artifacts (27% vs. 64%). We establish the first benchmark for mathematical sentence recognition (S2L-sentences) and achieve an equation CER of 40%. This work lays the groundwork for future advances in multimodal AI, with a particular focus on mathematical content recognition.

음성-수식 변환: 음성으로 표현된 수식 및 문장 변환을 위한 새로운 모델과 데이터셋

Speech-to-LaTeX: New Models and Datasets for Converting Spoken Equations and Sentences

초록

Support