音声からLaTeXへ：数式と文章の音声変換のための新たなモデルとデータセット

要旨

数式の発話変換は、音声を厳密に構造化された記号表現に書き起こすと同時に、方程式の発音に内在する曖昧性に対処する必要があるため、困難な課題である。自動音声認識（ASR）と言語モデル（LM）においては大きな進展が見られるものの、音声数式をLaTeXに変換する問題は未だ十分に研究されていない。この課題は、講義の文字起こしやノート作成といった教育・研究分野に直接応用可能である。これまでの研究では、ASRの後処理に基づく手法が採用されており、2回の書き起こしを必要とし、単独の方程式にのみ焦点を当て、限定的なテストセットを使用し、トレーニングデータや多言語対応も提供されていない。これらの課題に対処するため、我々は初めての完全オープンソースの大規模データセットを提示する。このデータセットは、英語とロシア語の両方で、多様な科学分野から収集された66,000以上の人間による注釈付き音声サンプル（数式と文）を含む。ASR後処理モデルや少数ショットプロンプティングに加え、音声言語モデルを適用し、MathSpeechベンチマークにおける数式変換の文字誤り率（CER）で同等の結果（28%対30%）を示した。一方、提案したS2L-equationsベンチマークでは、LaTeXフォーマットのアーティファクトを考慮した後でも、MathSpeechモデルを40ポイント以上の大幅な差で上回った（27%対64%）。さらに、数式文認識（S2L-sentences）の初のベンチマークを確立し、40%のCERを達成した。本研究は、特に数式コンテンツ認識に焦点を当てたマルチモーダルAIの今後の進展の基盤を築くものである。

English

Conversion of spoken mathematical expressions is a challenging task that involves transcribing speech into a strictly structured symbolic representation while addressing the ambiguity inherent in the pronunciation of equations. Although significant progress has been achieved in automatic speech recognition (ASR) and language models (LM), the problem of converting spoken mathematics into LaTeX remains underexplored. This task directly applies to educational and research domains, such as lecture transcription or note creation. Based on ASR post-correction, prior work requires 2 transcriptions, focuses only on isolated equations, has a limited test set, and provides neither training data nor multilingual coverage. To address these issues, we present the first fully open-source large-scale dataset, comprising over 66,000 human-annotated audio samples of mathematical equations and sentences in both English and Russian, drawn from diverse scientific domains. In addition to the ASR post-correction models and few-shot prompting, we apply audio language models, demonstrating comparable character error rate (CER) results on the MathSpeech benchmark (28% vs. 30%) for the equations conversion. In contrast, on the proposed S2L-equations benchmark, our models outperform the MathSpeech model by a substantial margin of more than 40 percentage points, even after accounting for LaTeX formatting artifacts (27% vs. 64%). We establish the first benchmark for mathematical sentence recognition (S2L-sentences) and achieve an equation CER of 40%. This work lays the groundwork for future advances in multimodal AI, with a particular focus on mathematical content recognition.

音声からLaTeXへ：数式と文章の音声変換のための新たなモデルとデータセット

Speech-to-LaTeX: New Models and Datasets for Converting Spoken Equations and Sentences

要旨

Support