German4All - 독일어 가독성 제어 패러프레이징을 위한 데이터셋 및 모델

초록

다양한 독자 그룹에 맞춤화할 수 있는 접근성 높은 텍스트를 생성하기 위해서는 다양한 복잡도 수준 간의 텍스트 패러프레이징 능력이 필수적입니다. 이에 따라 우리는 독일어에서 최초로 대규모로 정렬된 가독성 조절 단락 수준 패러프레이즈 데이터셋인 German4All을 소개합니다. 이 데이터셋은 5가지 가독성 수준을 포괄하며 25,000개 이상의 샘플로 구성되어 있습니다. 데이터셋은 GPT-4를 사용해 자동으로 합성되었으며, 인간과 대형 언어 모델(LLM) 기반 평가를 통해 엄격히 검증되었습니다. German4All을 활용하여 우리는 오픈소스 가독성 조절 패러프레이징 모델을 학습시켰으며, 이 모델은 독일어 텍스트 단순화 분야에서 최첨단 성능을 달성하여 더욱 세밀하고 독자 맞춤형 적응을 가능하게 합니다. 우리는 다중 수준 패러프레이징 연구를 촉진하기 위해 데이터셋과 모델 모두를 오픈소스로 공개합니다.

English

The ability to paraphrase texts across different complexity levels is essential for creating accessible texts that can be tailored toward diverse reader groups. Thus, we introduce German4All, the first large-scale German dataset of aligned readability-controlled, paragraph-level paraphrases. It spans five readability levels and comprises over 25,000 samples. The dataset is automatically synthesized using GPT-4 and rigorously evaluated through both human and LLM-based judgments. Using German4All, we train an open-source, readability-controlled paraphrasing model that achieves state-of-the-art performance in German text simplification, enabling more nuanced and reader-specific adaptations. We opensource both the dataset and the model to encourage further research on multi-level paraphrasing