어떤 텍스트든 분할하기: 강력하고 효율적이며 적응 가능한 문장 분할을 위한 보편적 접근법

초록

텍스트를 문장 단위로 분할하는 작업은 많은 NLP 시스템에서 초기 단계이면서도 중요한 역할을 합니다. 이는 일반적으로 구두점과 같은 어휘적 특징에 의존하는 규칙 기반 또는 통계적 방법을 사용하여 달성됩니다. 최근 일부 연구에서는 더 이상 구두점에만 의존하지 않지만, 우리는 기존의 어떤 방법도 (i) 구두점 누락에 대한 강건성, (ii) 새로운 도메인에 대한 효과적인 적응성, (iii) 높은 효율성이라는 세 가지 조건을 모두 충족하지 못한다는 사실을 발견했습니다. 이 문제를 해결하기 위해 우리는 새로운 모델인 'Segment any Text (SaT)'를 소개합니다. 강건성을 높이기 위해 구두점에 대한 의존도를 줄이는 새로운 사전 학습 방식을 제안합니다. 적응성을 해결하기 위해 파라미터 효율적인 미세 조정 단계를 추가하여, 가사나 법률 문서와 같은 다양한 도메인에서 최첨단 성능을 달성합니다. 또한, 아키텍처 수정을 통해 이전 최신 기술 대비 세 배의 속도 향상을 이루었고, 먼 미래의 문맥에 대한 잘못된 의존성을 해결했습니다. 마지막으로, 다국어 문장 분할 데이터에 대한 미세 조정을 적용한 모델 변형을 도입하여, 기존 분할 도구를 대체하고 개선할 수 있는 방안을 제시합니다. 전반적으로, 우리의 기여는 어떤 텍스트든 분할할 수 있는 보편적인 접근 방식을 제공합니다. 우리의 방법은 다양한 도메인과 언어를 아우르는 8개 코퍼스에서 강력한 대형 언어 모델(LLM)을 포함한 모든 기준선을 능가하며, 특히 텍스트가 잘못 포맷된 실질적으로 중요한 상황에서 뛰어난 성능을 보입니다. 우리의 모델과 코드, 문서는 MIT 라이선스 하에 https://huggingface.co/segment-any-text에서 이용 가능합니다.

English

Segmenting text into sentences plays an early and crucial role in many NLP systems. This is commonly achieved by using rule-based or statistical methods relying on lexical features such as punctuation. Although some recent works no longer exclusively rely on punctuation, we find that no prior method achieves all of (i) robustness to missing punctuation, (ii) effective adaptability to new domains, and (iii) high efficiency. We introduce a new model - Segment any Text (SaT) - to solve this problem. To enhance robustness, we propose a new pretraining scheme that ensures less reliance on punctuation. To address adaptability, we introduce an extra stage of parameter-efficient fine-tuning, establishing state-of-the-art performance in distinct domains such as verses from lyrics and legal documents. Along the way, we introduce architectural modifications that result in a threefold gain in speed over the previous state of the art and solve spurious reliance on context far in the future. Finally, we introduce a variant of our model with fine-tuning on a diverse, multilingual mixture of sentence-segmented data, acting as a drop-in replacement and enhancement for existing segmentation tools. Overall, our contributions provide a universal approach for segmenting any text. Our method outperforms all baselines - including strong LLMs - across 8 corpora spanning diverse domains and languages, especially in practically relevant situations where text is poorly formatted. Our models and code, including documentation, are available at https://huggingface.co/segment-any-text under the MIT license.

어떤 텍스트든 분할하기: 강력하고 효율적이며 적응 가능한 문장 분할을 위한 보편적 접근법

Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation

초록

Support