文本分段:一種適用於強健、高效和適應性句子分段的通用方法。
Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation
June 24, 2024
作者: Markus Frohmann, Igor Sterner, Ivan Vulić, Benjamin Minixhofer, Markus Schedl
cs.AI
摘要
將文本分割成句子在許多自然語言處理系統中扮演著早期和至關重要的角色。通常透過使用基於規則或統計方法依賴於標點等詞彙特徵來實現這一目標。儘管一些最近的研究不再僅依賴於標點符號,我們發現以往的方法均無法達到以下所有目標:(i) 對缺失標點的韌性,(ii) 對新領域的有效適應性,以及(iii) 高效率。我們引入了一個新模型 - Segment any Text (SaT) - 來解決這個問題。為了增強韌性,我們提出了一個新的預訓練方案,確保對標點的依賴較少。為了應對適應性,我們引入了一個額外的參數高效微調階段,確立了在不同領域(如歌詞和法律文件)中的最先進表現。在此過程中,我們引入了架構修改,使速度比先前的最新技術提升了三倍,並解決了對未來上下文的不必要依賴。最後,我們提出了我們模型的變體,通過對多樣化、多語言混合的句子分割數據進行微調,作為現有分割工具的替代和增強。總的來說,我們的貢獻提供了一種通用的文本分割方法。我們的方法在跨越不同領域和語言的8個語料庫中表現優異,特別是在文本格式混亂的實際情況下,超越了所有基準線,包括強大的LLM。我們的模型和代碼,包括文檔,可在MIT許可下的https://huggingface.co/segment-any-text找到。
English
Segmenting text into sentences plays an early and crucial role in many NLP
systems. This is commonly achieved by using rule-based or statistical methods
relying on lexical features such as punctuation. Although some recent works no
longer exclusively rely on punctuation, we find that no prior method achieves
all of (i) robustness to missing punctuation, (ii) effective adaptability to
new domains, and (iii) high efficiency. We introduce a new model - Segment any
Text (SaT) - to solve this problem. To enhance robustness, we propose a new
pretraining scheme that ensures less reliance on punctuation. To address
adaptability, we introduce an extra stage of parameter-efficient fine-tuning,
establishing state-of-the-art performance in distinct domains such as verses
from lyrics and legal documents. Along the way, we introduce architectural
modifications that result in a threefold gain in speed over the previous state
of the art and solve spurious reliance on context far in the future. Finally,
we introduce a variant of our model with fine-tuning on a diverse, multilingual
mixture of sentence-segmented data, acting as a drop-in replacement and
enhancement for existing segmentation tools. Overall, our contributions provide
a universal approach for segmenting any text. Our method outperforms all
baselines - including strong LLMs - across 8 corpora spanning diverse domains
and languages, especially in practically relevant situations where text is
poorly formatted. Our models and code, including documentation, are available
at https://huggingface.co/segment-any-text under the MIT license.Summary
AI-Generated Summary