任意のテキストをセグメント化：堅牢で効率的かつ適応性の高い文分割のための汎用的アプローチ

要旨

テキストを文に分割することは、多くのNLPシステムにおいて初期段階で重要な役割を果たします。これは通常、句読点などの語彙的特徴に依存したルールベースまたは統計的手法を用いて達成されます。最近の研究では句読点に依存しないものもありますが、既存の手法では以下のすべてを同時に達成するものはありませんでした。(i) 句読点の欠落に対する頑健性、(ii) 新しいドメインへの効果的な適応性、(iii) 高い効率性。この問題を解決するため、我々は新しいモデル「Segment any Text (SaT)」を提案します。頑健性を高めるために、句読点への依存を減らす新しい事前学習スキームを提案します。適応性に対処するため、パラメータ効率の良いファインチューニングの追加段階を導入し、歌詞や法律文書などの異なるドメインで最先端の性能を実現します。その過程で、アーキテクチャの変更を行い、従来の最先端技術に比べて3倍の速度向上を達成し、遠い未来の文脈への誤った依存を解消します。最後に、多様な多言語混合の文分割データでファインチューニングしたモデルのバリアントを導入し、既存の分割ツールの代替および強化として機能させます。全体として、我々の貢献は、あらゆるテキストを分割するための普遍的なアプローチを提供します。我々の手法は、多様なドメインと言語にわたる8つのコーパスにおいて、特に実用的に関連する状況（テキストのフォーマットが不十分な場合）において、強力なLLMを含むすべてのベースラインを上回ります。我々のモデルとコード（ドキュメントを含む）は、MITライセンスの下でhttps://huggingface.co/segment-any-textで公開されています。

English

Segmenting text into sentences plays an early and crucial role in many NLP systems. This is commonly achieved by using rule-based or statistical methods relying on lexical features such as punctuation. Although some recent works no longer exclusively rely on punctuation, we find that no prior method achieves all of (i) robustness to missing punctuation, (ii) effective adaptability to new domains, and (iii) high efficiency. We introduce a new model - Segment any Text (SaT) - to solve this problem. To enhance robustness, we propose a new pretraining scheme that ensures less reliance on punctuation. To address adaptability, we introduce an extra stage of parameter-efficient fine-tuning, establishing state-of-the-art performance in distinct domains such as verses from lyrics and legal documents. Along the way, we introduce architectural modifications that result in a threefold gain in speed over the previous state of the art and solve spurious reliance on context far in the future. Finally, we introduce a variant of our model with fine-tuning on a diverse, multilingual mixture of sentence-segmented data, acting as a drop-in replacement and enhancement for existing segmentation tools. Overall, our contributions provide a universal approach for segmenting any text. Our method outperforms all baselines - including strong LLMs - across 8 corpora spanning diverse domains and languages, especially in practically relevant situations where text is poorly formatted. Our models and code, including documentation, are available at https://huggingface.co/segment-any-text under the MIT license.

任意のテキストをセグメント化：堅牢で効率的かつ適応性の高い文分割のための汎用的アプローチ

Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation

要旨

Support