CC30k:面向可复现性情感分析的引文上下文数据集
CC30k: A Citation Contexts Dataset for Reproducibility-Oriented Sentiment Analysis
November 11, 2025
作者: Rochana R. Obadage, Sarah M. Rajtmajer, Jian Wu
cs.AI
摘要
下游文献中对引证论文可复现性的评价反映了学术界的普遍看法,并已显示出作为评估已发表研究成果实际可复现性的有效指标。为训练能精准预测可复现性导向情感的模型并系统研究其与可复现性的关联,我们推出CC30k数据集,该数据集包含机器学习论文中30,734条引文上下文。每条引文语境均标注有三种可复现性导向的情感标签之一:积极、消极或中立,以反映被引论文的可复现性或可复制性认知。其中25,829条通过众包标注,并采用受控流程生成消极标签以弥补其稀缺性。与传统情感分析数据集不同,CC30k专注于可复现性导向的情感分析,填补了计算可复现性研究领域的资源空白。该数据集通过包含稳健数据清洗、精细众包筛选和全面验证的流程构建,最终标注准确率达94%。实验表明,使用本数据集微调后,三种大语言模型在可复现性导向情感分类任务上的性能显著提升。该数据集为大规模评估机器学习论文的可复现性奠定了基础。CC30k数据集及用于生成分析数据集的Jupyter笔记本已公开于https://github.com/lamps-lab/CC30k。
English
Sentiments about the reproducibility of cited papers in downstream literature offer community perspectives and have shown as a promising signal of the actual reproducibility of published findings. To train effective models to effectively predict reproducibility-oriented sentiments and further systematically study their correlation with reproducibility, we introduce the CC30k dataset, comprising a total of 30,734 citation contexts in machine learning papers. Each citation context is labeled with one of three reproducibility-oriented sentiment labels: Positive, Negative, or Neutral, reflecting the cited paper's perceived reproducibility or replicability. Of these, 25,829 are labeled through crowdsourcing, supplemented with negatives generated through a controlled pipeline to counter the scarcity of negative labels. Unlike traditional sentiment analysis datasets, CC30k focuses on reproducibility-oriented sentiments, addressing a research gap in resources for computational reproducibility studies. The dataset was created through a pipeline that includes robust data cleansing, careful crowd selection, and thorough validation. The resulting dataset achieves a labeling accuracy of 94%. We then demonstrated that the performance of three large language models significantly improves on the reproducibility-oriented sentiment classification after fine-tuning using our dataset. The dataset lays the foundation for large-scale assessments of the reproducibility of machine learning papers. The CC30k dataset and the Jupyter notebooks used to produce and analyze the dataset are publicly available at https://github.com/lamps-lab/CC30k .