ChatPaper.aiChatPaper

SynthDetoxM:現代少樣本平行解毒資料標註器

SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators

February 10, 2025
作者: Daniil Moskovskiy, Nikita Sushko, Sergey Pletenev, Elena Tutubalina, Alexander Panchenko
cs.AI

摘要

現有的多語言文本淨化方法受制於平行多語言數據集的稀缺。在這項工作中,我們介紹了一個用於生成多語言平行淨化數據的流程。我們還介紹了SynthDetoxM,這是一個手動收集和合成生成的多語言平行文本淨化數據集,包括德語、法語、西班牙語和俄語,共包含16,000個高質量的淨化句對。這些數據來自不同的毒性評估數據集,然後在少樣本設置中使用九個現代開源LLM進行重寫。我們的實驗表明,在數據有限的情況下,訓練在生成的合成數據集上的模型表現優於在人工標註的MultiParaDetox數據集上訓練的模型。在少樣本設置中,訓練在SynthDetoxM上的模型勝過所有評估的LLM。我們釋出我們的數據集和代碼,以幫助進一步研究多語言文本淨化。
English
Existing approaches to multilingual text detoxification are hampered by the scarcity of parallel multilingual datasets. In this work, we introduce a pipeline for the generation of multilingual parallel detoxification data. We also introduce SynthDetoxM, a manually collected and synthetically generated multilingual parallel text detoxification dataset comprising 16,000 high-quality detoxification sentence pairs across German, French, Spanish and Russian. The data was sourced from different toxicity evaluation datasets and then rewritten with nine modern open-source LLMs in few-shot setting. Our experiments demonstrate that models trained on the produced synthetic datasets have superior performance to those trained on the human-annotated MultiParaDetox dataset even in data limited setting. Models trained on SynthDetoxM outperform all evaluated LLMs in few-shot setting. We release our dataset and code to help further research in multilingual text detoxification.

Summary

AI-Generated Summary

PDF902February 11, 2025