語音轉LaTeX：轉換口述方程式與句子的新模型與數據集

摘要

口述數學表達式的轉換是一項具有挑戰性的任務，它涉及將語音轉錄為嚴格結構化的符號表示，同時解決方程發音中固有的歧義性。儘管在自動語音識別（ASR）和語言模型（LM）方面已取得顯著進展，但將口述數學轉換為LaTeX的問題仍未得到充分探索。此任務直接應用於教育和研究領域，如講課轉錄或筆記創建。基於ASR後校正的先前工作需進行兩次轉錄，僅專注於孤立方程，測試集有限，且未提供訓練數據或多語言覆蓋。為解決這些問題，我們提出了首個完全開源的大規模數據集，包含超過66,000個人類註釋的數學方程和句子的音頻樣本，涵蓋英語和俄語，並來自多樣化的科學領域。除了ASR後校正模型和少樣本提示外，我們還應用了音頻語言模型，在MathSpeech基準上展示了可比較的字符錯誤率（CER）結果（28%對30%）用於方程轉換。相比之下，在提出的S2L-equations基準上，即使考慮到LaTeX格式的偽影，我們的模型也以超過40個百分點的顯著優勢超越了MathSpeech模型（27%對64%）。我們建立了首個數學句子識別（S2L-sentences）基準，並達到了40%的方程CER。這項工作為未來多模態AI的進步奠定了基礎，特別是在數學內容識別方面。

English

Conversion of spoken mathematical expressions is a challenging task that involves transcribing speech into a strictly structured symbolic representation while addressing the ambiguity inherent in the pronunciation of equations. Although significant progress has been achieved in automatic speech recognition (ASR) and language models (LM), the problem of converting spoken mathematics into LaTeX remains underexplored. This task directly applies to educational and research domains, such as lecture transcription or note creation. Based on ASR post-correction, prior work requires 2 transcriptions, focuses only on isolated equations, has a limited test set, and provides neither training data nor multilingual coverage. To address these issues, we present the first fully open-source large-scale dataset, comprising over 66,000 human-annotated audio samples of mathematical equations and sentences in both English and Russian, drawn from diverse scientific domains. In addition to the ASR post-correction models and few-shot prompting, we apply audio language models, demonstrating comparable character error rate (CER) results on the MathSpeech benchmark (28% vs. 30%) for the equations conversion. In contrast, on the proposed S2L-equations benchmark, our models outperform the MathSpeech model by a substantial margin of more than 40 percentage points, even after accounting for LaTeX formatting artifacts (27% vs. 64%). We establish the first benchmark for mathematical sentence recognition (S2L-sentences) and achieve an equation CER of 40%. This work lays the groundwork for future advances in multimodal AI, with a particular focus on mathematical content recognition.