文単位の音声要約：タスク、データセット、およびLM知識蒸留を用いたエンドツーエンドモデリング

要旨

本論文では、音声文書から文単位でテキスト要約を生成する新しいアプローチであるSentence-wise Speech Summarization（Sen-SSum）を紹介する。Sen-SSumは、自動音声認識（ASR）のリアルタイム処理と音声要約の簡潔さを組み合わせたものである。このアプローチを探求するため、我々はSen-SSum用の2つのデータセット、Mega-SSumとCSJ-SSumを提示する。これらのデータセットを用いて、我々の研究では2種類のTransformerベースのモデルを評価する：1）ASRと強力なテキスト要約モデルを組み合わせたカスケードモデル、2）音声を直接テキスト要約に変換するエンドツーエンド（E2E）モデルである。E2Eモデルは計算効率の良いモデルを開発する上で魅力的であるが、カスケードモデルよりも性能が劣る。そこで、我々はカスケードモデルによって生成された疑似要約を用いてE2Eモデルの知識蒸留を提案する。実験結果は、この提案された知識蒸留が両データセットにおいてE2Eモデルの性能を効果的に向上させることを示している。

English

This paper introduces a novel approach called sentence-wise speech summarization (Sen-SSum), which generates text summaries from a spoken document in a sentence-by-sentence manner. Sen-SSum combines the real-time processing of automatic speech recognition (ASR) with the conciseness of speech summarization. To explore this approach, we present two datasets for Sen-SSum: Mega-SSum and CSJ-SSum. Using these datasets, our study evaluates two types of Transformer-based models: 1) cascade models that combine ASR and strong text summarization models, and 2) end-to-end (E2E) models that directly convert speech into a text summary. While E2E models are appealing to develop compute-efficient models, they perform worse than cascade models. Therefore, we propose knowledge distillation for E2E models using pseudo-summaries generated by the cascade models. Our experiments show that this proposed knowledge distillation effectively improves the performance of the E2E model on both datasets.

文単位の音声要約：タスク、データセット、およびLM知識蒸留を用いたエンドツーエンドモデリング

Sentence-wise Speech Summarization: Task, Datasets, and End-to-End Modeling with LM Knowledge Distillation

要旨

Support