ChatPaper.aiChatPaper

TF1-EN-3M:三百万则合成道德寓言,用于训练小型开源语言模型

TF1-EN-3M: Three Million Synthetic Moral Fables for Training Small, Open Language Models

April 29, 2025
作者: Mihai Nadas, Laura Diosan, Andrei Piscoran, Andreea Tomescu
cs.AI

摘要

道德故事是传递价值观的经典载体,然而现代自然语言处理领域尚缺乏一个将连贯叙事与明确道德教训相结合的大型结构化语料库。我们通过TF1-EN-3M填补了这一空白,这是首个由不超过80亿参数的指令调优模型生成的、包含三百万条英语寓言的开源数据集。每个故事遵循六段式框架(角色 -> 特质 -> 场景 -> 冲突 -> 解决 -> 寓意),通过组合式提示引擎生成,既保证了体裁的忠实性,又覆盖了广泛的主题空间。 我们采用混合评估流程,结合了(i)基于GPT的批评系统,对语法、创意、道德清晰度和模板遵循度进行评分,以及(ii)无参考的多样性和可读性指标。在十个开源权重候选模型中,一个80亿参数的Llama-3变体展现了最佳的质量与速度平衡,在单块消费级GPU(显存<24GB)上以每千则寓言约13.5美分的成本产出高评分寓言。 我们以宽松许可发布了该数据集、生成代码、评估脚本及完整元数据,确保了精确的可复现性和成本基准测试。TF1-EN-3M为指令跟随、叙事智能、价值对齐及儿童友好型教育AI的研究开辟了新途径,证明大规模道德叙事不再依赖于专有的巨型模型。
English
Moral stories are a time-tested vehicle for transmitting values, yet modern NLP lacks a large, structured corpus that couples coherent narratives with explicit ethical lessons. We close this gap with TF1-EN-3M, the first open dataset of three million English-language fables generated exclusively by instruction-tuned models no larger than 8B parameters. Each story follows a six-slot scaffold (character -> trait -> setting -> conflict -> resolution -> moral), produced through a combinatorial prompt engine that guarantees genre fidelity while covering a broad thematic space. A hybrid evaluation pipeline blends (i) a GPT-based critic that scores grammar, creativity, moral clarity, and template adherence with (ii) reference-free diversity and readability metrics. Among ten open-weight candidates, an 8B-parameter Llama-3 variant delivers the best quality-speed trade-off, producing high-scoring fables on a single consumer GPU (<24 GB VRAM) at approximately 13.5 cents per 1,000 fables. We release the dataset, generation code, evaluation scripts, and full metadata under a permissive license, enabling exact reproducibility and cost benchmarking. TF1-EN-3M opens avenues for research in instruction following, narrative intelligence, value alignment, and child-friendly educational AI, demonstrating that large-scale moral storytelling no longer requires proprietary giant models.

Summary

AI-Generated Summary

PDF52May 4, 2025