German4All - 德语可读性控制复述数据集与模型
German4All - A Dataset and Model for Readability-Controlled Paraphrasing in German
August 25, 2025
作者: Miriam Anschütz, Thanh Mai Pham, Eslam Nasrallah, Maximilian Müller, Cristian-George Craciun, Georg Groh
cs.AI
摘要
跨不同复杂度层次进行文本改述的能力,对于创建可针对多样化读者群体定制的易读文本至关重要。为此,我们推出了German4All,这是首个大规模德语对齐可读性控制的段落级改述数据集。该数据集覆盖五个可读性等级,包含超过25,000个样本。数据集通过GPT-4自动合成,并经过严格的人工与基于大语言模型的评估。利用German4All,我们训练了一个开源的可读性控制改述模型,该模型在德语文本简化任务中达到了业界领先水平,实现了更为细致且针对特定读者的文本适配。我们公开了数据集与模型,以促进多层次改述领域的进一步研究。
English
The ability to paraphrase texts across different complexity levels is
essential for creating accessible texts that can be tailored toward diverse
reader groups. Thus, we introduce German4All, the first large-scale German
dataset of aligned readability-controlled, paragraph-level paraphrases. It
spans five readability levels and comprises over 25,000 samples. The dataset is
automatically synthesized using GPT-4 and rigorously evaluated through both
human and LLM-based judgments. Using German4All, we train an open-source,
readability-controlled paraphrasing model that achieves state-of-the-art
performance in German text simplification, enabling more nuanced and
reader-specific adaptations. We opensource both the dataset and the model to
encourage further research on multi-level paraphrasing