ChatPaper.aiChatPaper

FineVision:开放数据即所需

FineVision: Open Data Is All You Need

October 20, 2025
作者: Luis Wiedmann, Orr Zohar, Amir Mahla, Xiaohan Wang, Rui Li, Thibaud Frere, Leandro von Werra, Aritra Roy Gosthipaty, Andrés Marafioti
cs.AI

摘要

视觉语言模型(VLMs)的发展因公共数据集碎片化、不一致且受污染而受阻。我们推出FineVision,这是一个精心收集、整理并统一整合的包含2400万样本的语料库——同类中规模最大的开放资源。通过半自动化、人机协作的流程,我们将200多个来源统一为185个子集:自动化负责批量导入与模式映射,而审核人员则检查映射并抽样验证输出,确保注释的忠实采用、格式的恰当与多样性以及安全性;发现问题则触发针对性修复与重新运行。该工作流还实施了严格的源内与跨源去重,并针对66个公共基准进行了去污染处理。FineVision还涵盖了代理/GUI任务,采用统一动作空间;审核人员验证模式并检查部分轨迹样本,以确保执行的真实性。在广泛的评估套件中,基于FineVision训练的模型持续超越现有开放混合数据集训练的模型,凸显了规模效应、数据清洁度以及人机协同平衡自动化的优势。我们发布该语料库及整理工具,以加速以数据为中心的VLM研究。
English
The advancement of vision-language models (VLMs) is hampered by a fragmented landscape of inconsistent and contaminated public datasets. We introduce FineVision, a meticulously collected, curated, and unified corpus of 24 million samples - the largest open resource of its kind. We unify more than 200 sources into 185 subsets via a semi-automated, human-in-the-loop pipeline: automation performs bulk ingestion and schema mapping, while reviewers audit mappings and spot-check outputs to verify faithful consumption of annotations, appropriate formatting and diversity, and safety; issues trigger targeted fixes and re-runs. The workflow further applies rigorous de-duplication within and across sources and decontamination against 66 public benchmarks. FineVision also encompasses agentic/GUI tasks with a unified action space; reviewers validate schemas and inspect a sample of trajectories to confirm executable fidelity. Models trained on FineVision consistently outperform those trained on existing open mixtures across a broad evaluation suite, underscoring the benefits of scale, data hygiene, and balanced automation with human oversight. We release the corpus and curation tools to accelerate data-centric VLM research.
PDF463October 21, 2025