FineVision:開放數據即為所需
FineVision: Open Data Is All You Need
October 20, 2025
作者: Luis Wiedmann, Orr Zohar, Amir Mahla, Xiaohan Wang, Rui Li, Thibaud Frere, Leandro von Werra, Aritra Roy Gosthipaty, Andrés Marafioti
cs.AI
摘要
視覺語言模型(VLMs)的發展受到公共數據集碎片化、不一致和污染問題的阻礙。我們推出了FineVision,這是一個精心收集、整理和統一的包含2400萬個樣本的語料庫,是該領域最大的開放資源。我們通過半自動化、人機協作的流程,將200多個來源統一為185個子集:自動化處理批量數據攝取和模式映射,而審查人員則審核映射並抽樣檢查輸出,以確保註釋的準確使用、適當的格式和多樣性,以及安全性;發現問題時會觸發針對性修復和重新運行。該工作流程還對數據源內部和跨數據源進行嚴格的去重處理,並針對66個公共基準進行去污染。FineVision還涵蓋了代理/GUI任務,並提供統一的動作空間;審查人員驗證模式並檢查部分軌跡樣本,以確認可執行性保真度。在FineVision上訓練的模型在廣泛的評估套件中始終優於在現有開放混合數據集上訓練的模型,這凸顯了規模、數據衛生以及人機協作平衡自動化的優勢。我們發布了該語料庫和整理工具,以加速以數據為中心的VLM研究。
English
The advancement of vision-language models (VLMs) is hampered by a fragmented
landscape of inconsistent and contaminated public datasets. We introduce
FineVision, a meticulously collected, curated, and unified corpus of 24 million
samples - the largest open resource of its kind. We unify more than 200 sources
into 185 subsets via a semi-automated, human-in-the-loop pipeline: automation
performs bulk ingestion and schema mapping, while reviewers audit mappings and
spot-check outputs to verify faithful consumption of annotations, appropriate
formatting and diversity, and safety; issues trigger targeted fixes and
re-runs. The workflow further applies rigorous de-duplication within and across
sources and decontamination against 66 public benchmarks. FineVision also
encompasses agentic/GUI tasks with a unified action space; reviewers validate
schemas and inspect a sample of trajectories to confirm executable fidelity.
Models trained on FineVision consistently outperform those trained on existing
open mixtures across a broad evaluation suite, underscoring the benefits of
scale, data hygiene, and balanced automation with human oversight. We release
the corpus and curation tools to accelerate data-centric VLM research.