파인비전: 오픈 데이터만으로 충분하다

초록

비전-언어 모델(VLMs)의 발전은 일관성 없고 오염된 공개 데이터셋의 파편화된 환경으로 인해 방해받고 있습니다. 우리는 2,400만 개의 샘플로 구성된 세심하게 수집, 정제, 통합된 코퍼스인 FineVision을 소개합니다. 이는 동종 최대 규모의 오픈 리소스입니다. 우리는 200개 이상의 소스를 반자동화된 인간 참여형 파이프라인을 통해 185개의 하위 집합으로 통합했습니다: 자동화는 대량 수집과 스키마 매핑을 수행하고, 검토자는 매핑을 감사하고 출력물을 샘플 검사하여 주석의 충실한 소비, 적절한 형식 및 다양성, 안전성을 확인합니다; 문제가 발생하면 표적 수정과 재실행을 트리거합니다. 이 워크플로우는 소스 내 및 소스 간의 엄격한 중복 제거와 66개의 공개 벤치마크에 대한 오염 제거를 추가로 적용합니다. FineVision은 또한 통합된 액션 공간을 가진 에이전트/GUI 작업을 포함하며, 검토자는 스키마를 검증하고 궤적 샘플을 검사하여 실행 가능한 충실도를 확인합니다. FineVision으로 훈련된 모델은 광범위한 평가 스위트에서 기존의 오픈 혼합 데이터셋으로 훈련된 모델들을 일관되게 능가하며, 규모, 데이터 위생, 인간 감독과의 균형 잡힌 자동화의 이점을 강조합니다. 우리는 데이터 중심의 VLM 연구를 가속화하기 위해 코퍼스와 정제 도구를 공개합니다.

English

The advancement of vision-language models (VLMs) is hampered by a fragmented landscape of inconsistent and contaminated public datasets. We introduce FineVision, a meticulously collected, curated, and unified corpus of 24 million samples - the largest open resource of its kind. We unify more than 200 sources into 185 subsets via a semi-automated, human-in-the-loop pipeline: automation performs bulk ingestion and schema mapping, while reviewers audit mappings and spot-check outputs to verify faithful consumption of annotations, appropriate formatting and diversity, and safety; issues trigger targeted fixes and re-runs. The workflow further applies rigorous de-duplication within and across sources and decontamination against 66 public benchmarks. FineVision also encompasses agentic/GUI tasks with a unified action space; reviewers validate schemas and inspect a sample of trajectories to confirm executable fidelity. Models trained on FineVision consistently outperform those trained on existing open mixtures across a broad evaluation suite, underscoring the benefits of scale, data hygiene, and balanced automation with human oversight. We release the corpus and curation tools to accelerate data-centric VLM research.