MisSynth:通过合成数据改进MISSCI逻辑谬误分类
MisSynth: Improving MISSCI Logical Fallacies Classification with Synthetic Data
October 30, 2025
作者: Mykhailo Poliakov, Nadiya Shvai
cs.AI
摘要
健康相關的虛假資訊極其普遍且潛藏危害,尤其當論述扭曲或誤解科學發現時更難辨識。本研究基於MISSCI數據集與框架,探討合成數據生成與輕量級微調技術對大型語言模型識別謬誤論證能力的影響。我們提出MisSynth流程,應用檢索增強生成技術創建謬論合成樣本,並以此對LLM模型進行微調。實驗結果顯示,經微調的模型相較原始基準線獲得顯著準確度提升:例如LLaMA 3.1 8B微調模型在MISSCI測試集上的F1分數較基準線絕對提升超過35%。我們證實透過合成謬論數據增強有限標註資源,能大幅提升LLM在真實場景科學虛假資訊分類任務的零樣本表現,即使僅使用有限計算資源亦能實現。程式碼與合成數據集公開於https://github.com/mxpoliakov/MisSynth。
English
Health-related misinformation is very prevalent and potentially harmful. It
is difficult to identify, especially when claims distort or misinterpret
scientific findings. We investigate the impact of synthetic data generation and
lightweight fine-tuning techniques on the ability of large language models
(LLMs) to recognize fallacious arguments using the MISSCI dataset and
framework. In this work, we propose MisSynth, a pipeline that applies
retrieval-augmented generation (RAG) to produce synthetic fallacy samples,
which are then used to fine-tune an LLM model. Our results show substantial
accuracy gains with fine-tuned models compared to vanilla baselines. For
instance, the LLaMA 3.1 8B fine-tuned model achieved an over 35% F1-score
absolute improvement on the MISSCI test split over its vanilla baseline. We
demonstrate that introducing synthetic fallacy data to augment limited
annotated resources can significantly enhance zero-shot LLM classification
performance on real-world scientific misinformation tasks, even with limited
computational resources. The code and synthetic dataset are available on
https://github.com/mxpoliakov/MisSynth.