German4All - 德語可讀性控制改寫的數據集與模型
German4All - A Dataset and Model for Readability-Controlled Paraphrasing in German
August 25, 2025
作者: Miriam Anschütz, Thanh Mai Pham, Eslam Nasrallah, Maximilian Müller, Cristian-George Craciun, Georg Groh
cs.AI
摘要
跨不同複雜度層次進行文本改寫的能力,對於創建可針對多元讀者群體量身定制的易讀文本至關重要。因此,我們推出了German4All,這是首個大規模德語對齊可讀性控制段落級改寫數據集。該數據集涵蓋五個可讀性等級,包含超過25,000個樣本。數據集通過GPT-4自動合成,並通過人工和基於大語言模型的評判進行嚴格評估。利用German4All,我們訓練了一個開源的可讀性控制改寫模型,該模型在德語文本簡化任務中達到了最先進的性能,實現了更細膩且針對特定讀者的文本適應。我們開源了數據集和模型,以鼓勵對多層次改寫的進一步研究。
English
The ability to paraphrase texts across different complexity levels is
essential for creating accessible texts that can be tailored toward diverse
reader groups. Thus, we introduce German4All, the first large-scale German
dataset of aligned readability-controlled, paragraph-level paraphrases. It
spans five readability levels and comprises over 25,000 samples. The dataset is
automatically synthesized using GPT-4 and rigorously evaluated through both
human and LLM-based judgments. Using German4All, we train an open-source,
readability-controlled paraphrasing model that achieves state-of-the-art
performance in German text simplification, enabling more nuanced and
reader-specific adaptations. We opensource both the dataset and the model to
encourage further research on multi-level paraphrasing