文本分割：一种用于稳健、高效和可适应的句子分割的通用方法

摘要

在许多自然语言处理系统中，将文本分割成句子起着早期且至关重要的作用。通常通过使用基于规则或统计方法来实现，依赖于诸如标点符号之类的词汇特征。尽管一些最近的研究不再仅仅依赖于标点符号，但我们发现以往的方法都无法同时实现以下三点：(i) 对缺失标点的鲁棒性，(ii) 对新领域的有效适应性，以及(iii) 高效性。我们引入了一个新模型 - Segment any Text (SaT) - 来解决这个问题。为了增强鲁棒性，我们提出了一种新的预训练方案，确保更少地依赖于标点符号。为了解决适应性问题，我们引入了一个额外的参数高效微调阶段，在诸如歌词和法律文件等不同领域确立了最先进的性能。在此过程中，我们引入了架构修改，使速度比之前的最新技术提高了三倍，并解决了对未来很远的上下文的错误依赖。最后，我们介绍了我们模型的一个变体，通过在多样化、多语言混合的句子分割数据上进行微调，作为现有分割工具的即插即用替代和增强。总的来说，我们的贡献提供了一个通用的文本分割方法。我们的方法在涵盖不同领域和语言的8个语料库中表现优异，尤其在文本格式混乱的实际相关情境中，胜过所有基线模型 - 包括强大的LLMs。我们的模型和代码，包括文档，可在MIT许可下的https://huggingface.co/segment-any-text找到。

English

Segmenting text into sentences plays an early and crucial role in many NLP systems. This is commonly achieved by using rule-based or statistical methods relying on lexical features such as punctuation. Although some recent works no longer exclusively rely on punctuation, we find that no prior method achieves all of (i) robustness to missing punctuation, (ii) effective adaptability to new domains, and (iii) high efficiency. We introduce a new model - Segment any Text (SaT) - to solve this problem. To enhance robustness, we propose a new pretraining scheme that ensures less reliance on punctuation. To address adaptability, we introduce an extra stage of parameter-efficient fine-tuning, establishing state-of-the-art performance in distinct domains such as verses from lyrics and legal documents. Along the way, we introduce architectural modifications that result in a threefold gain in speed over the previous state of the art and solve spurious reliance on context far in the future. Finally, we introduce a variant of our model with fine-tuning on a diverse, multilingual mixture of sentence-segmented data, acting as a drop-in replacement and enhancement for existing segmentation tools. Overall, our contributions provide a universal approach for segmenting any text. Our method outperforms all baselines - including strong LLMs - across 8 corpora spanning diverse domains and languages, especially in practically relevant situations where text is poorly formatted. Our models and code, including documentation, are available at https://huggingface.co/segment-any-text under the MIT license.

文本分割：一种用于稳健、高效和可适应的句子分割的通用方法

Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation

摘要

Support