MisSynth:利用合成数据改进MISSCI逻辑谬误分类
MisSynth: Improving MISSCI Logical Fallacies Classification with Synthetic Data
October 30, 2025
作者: Mykhailo Poliakov, Nadiya Shvai
cs.AI
摘要
健康相关谬误信息广泛存在且具有潜在危害性,尤其当这些言论曲解或误读科学发现时更难以识别。本研究基于MISSCI数据集与框架,探究合成数据生成与轻量化微调技术对大型语言模型识别谬误论证能力的影响。我们提出MisSynth技术方案:通过检索增强生成技术创建合成谬误样本,进而微调大型语言模型。实验结果表明,经过微调的模型相较原始基线模型取得显著精度提升。例如,LLaMA 3.1 8B微调模型在MISSCI测试集上的F1分数较原始基线实现超过35%的绝对提升。研究证明,即使计算资源有限,通过引入合成谬误数据来扩充有限标注资源,能显著增强大型语言模型在真实场景科学谬误分类任务中的零样本性能。代码与合成数据集详见https://github.com/mxpoliakov/MisSynth。
English
Health-related misinformation is very prevalent and potentially harmful. It
is difficult to identify, especially when claims distort or misinterpret
scientific findings. We investigate the impact of synthetic data generation and
lightweight fine-tuning techniques on the ability of large language models
(LLMs) to recognize fallacious arguments using the MISSCI dataset and
framework. In this work, we propose MisSynth, a pipeline that applies
retrieval-augmented generation (RAG) to produce synthetic fallacy samples,
which are then used to fine-tune an LLM model. Our results show substantial
accuracy gains with fine-tuned models compared to vanilla baselines. For
instance, the LLaMA 3.1 8B fine-tuned model achieved an over 35% F1-score
absolute improvement on the MISSCI test split over its vanilla baseline. We
demonstrate that introducing synthetic fallacy data to augment limited
annotated resources can significantly enhance zero-shot LLM classification
performance on real-world scientific misinformation tasks, even with limited
computational resources. The code and synthetic dataset are available on
https://github.com/mxpoliakov/MisSynth.