语音转LaTeX：用于转换口语方程与句子的新模型与数据集

摘要

将口语化的数学表达式转换为严格的符号表示是一项极具挑战性的任务，这不仅涉及将语音转录为结构化符号，还需解决方程发音中固有的歧义问题。尽管自动语音识别（ASR）和语言模型（LM）已取得显著进展，但将口语数学转换为LaTeX格式的研究仍显不足。这一任务在教育与研究领域，如讲座转录或笔记创建中，具有直接应用价值。基于ASR后校正的先前工作，需进行两次转录，仅关注孤立方程，测试集有限，且缺乏训练数据及多语言覆盖。为解决这些问题，我们首次推出了一个完全开源的大规模数据集，包含超过66,000条人工标注的数学方程和句子的音频样本，涵盖英语和俄语，源自多个科学领域。除了ASR后校正模型和少样本提示外，我们还应用了音频语言模型，在MathSpeech基准测试中展示了可比的字符错误率（CER）结果（28%对比30%），用于方程转换。相比之下，在我们提出的S2L-equations基准测试中，即便考虑LaTeX格式因素，我们的模型仍以超过40个百分点的显著优势超越MathSpeech模型（27%对比64%）。我们首次建立了数学句子识别（S2L-sentences）的基准，并实现了40%的方程CER。此工作为未来多模态AI的进步，特别是在数学内容识别方面，奠定了坚实基础。

English

Conversion of spoken mathematical expressions is a challenging task that involves transcribing speech into a strictly structured symbolic representation while addressing the ambiguity inherent in the pronunciation of equations. Although significant progress has been achieved in automatic speech recognition (ASR) and language models (LM), the problem of converting spoken mathematics into LaTeX remains underexplored. This task directly applies to educational and research domains, such as lecture transcription or note creation. Based on ASR post-correction, prior work requires 2 transcriptions, focuses only on isolated equations, has a limited test set, and provides neither training data nor multilingual coverage. To address these issues, we present the first fully open-source large-scale dataset, comprising over 66,000 human-annotated audio samples of mathematical equations and sentences in both English and Russian, drawn from diverse scientific domains. In addition to the ASR post-correction models and few-shot prompting, we apply audio language models, demonstrating comparable character error rate (CER) results on the MathSpeech benchmark (28% vs. 30%) for the equations conversion. In contrast, on the proposed S2L-equations benchmark, our models outperform the MathSpeech model by a substantial margin of more than 40 percentage points, even after accounting for LaTeX formatting artifacts (27% vs. 64%). We establish the first benchmark for mathematical sentence recognition (S2L-sentences) and achieve an equation CER of 40%. This work lays the groundwork for future advances in multimodal AI, with a particular focus on mathematical content recognition.

语音转LaTeX：用于转换口语方程与句子的新模型与数据集

Speech-to-LaTeX: New Models and Datasets for Converting Spoken Equations and Sentences

摘要

Support