TF1-EN-3M：三百萬則合成道德寓言，用於訓練小型開放語言模型

摘要

道德故事是傳遞價值觀的經久不衰的載體，然而現代自然語言處理（NLP）領域缺乏一個將連貫敘事與明確道德教訓相結合的大型結構化語料庫。我們通過TF1-EN-3M填補了這一空白，這是首個由不超過80億參數的指令調優模型生成的、包含三百萬條英語寓言故事的開放數據集。每個故事遵循六段式框架（角色 -> 特質 -> 場景 -> 衝突 -> 解決 -> 道德），通過組合式提示引擎生成，既保證了文類的忠實性，又涵蓋了廣泛的主題空間。混合評估流程結合了（i）基於GPT的評判系統，對語法、創意、道德清晰度和模板遵循度進行評分，以及（ii）無參考的多樣性和可讀性指標。在十個開源候選模型中，一個80億參數的Llama-3變體展現了最佳的質量與速度平衡，在單個消費級GPU（<24 GB顯存）上以每千個故事約13.5美分的成本，生成高評分的寓言故事。我們以寬鬆的許可證發布了數據集、生成代碼、評估腳本及完整元數據，確保了精確的可重現性和成本基準測試。TF1-EN-3M為指令遵循、敘事智能、價值對齊及兒童友好型教育AI的研究開辟了新途徑，證明大規模道德敘事不再依賴於專有的巨型模型。

English

Moral stories are a time-tested vehicle for transmitting values, yet modern NLP lacks a large, structured corpus that couples coherent narratives with explicit ethical lessons. We close this gap with TF1-EN-3M, the first open dataset of three million English-language fables generated exclusively by instruction-tuned models no larger than 8B parameters. Each story follows a six-slot scaffold (character -> trait -> setting -> conflict -> resolution -> moral), produced through a combinatorial prompt engine that guarantees genre fidelity while covering a broad thematic space. A hybrid evaluation pipeline blends (i) a GPT-based critic that scores grammar, creativity, moral clarity, and template adherence with (ii) reference-free diversity and readability metrics. Among ten open-weight candidates, an 8B-parameter Llama-3 variant delivers the best quality-speed trade-off, producing high-scoring fables on a single consumer GPU (<24 GB VRAM) at approximately 13.5 cents per 1,000 fables. We release the dataset, generation code, evaluation scripts, and full metadata under a permissive license, enabling exact reproducibility and cost benchmarking. TF1-EN-3M opens avenues for research in instruction following, narrative intelligence, value alignment, and child-friendly educational AI, demonstrating that large-scale moral storytelling no longer requires proprietary giant models.