DataClaw0: 원시 스트림에서 멀티모달 데이터의 에이전트 기반 맞춤화

초록

대규모 비정형 멀티모달 스트림은 높은 '데이터 엔트로피'를 수반하여 효율적인 인간의 지식 습득과 고품질 AI 사후 학습(Post-Training)을 모두 저해합니다. 기존의 수동적 주석 패러다임은 휴리스틱 규칙이나 일반적인 VLM에 크게 의존하며, 비용이 많이 들고 단조로우며, 원시 데이터에 내재된 심층적 절차적 논리를 활용하는 데 실패합니다. 우리는 데이터 처리를 학습 가능한 능력으로 격상시키며, 적극적으로 데이터를 정제하고 구조화하여 다양한 사용자 및 다운스트림 의도에 부합하도록 하는 에이전틱 데이터 테일러링(Agentic Data Tailoring)으로의 패러다임 전환을 제안합니다. 이러한 고차원 능력 훈련에 있어 데이터 희소성 병목을 극복하기 위해, 우리는 결정론적 사실적 앵커(Factual Anchors)에 생성적 의미 합성을 기반으로 하는 2단계 파이프라인을 설계하여, 다섯 가지 핵심 물리 및 디지털 영역을 포괄하는 대규모 데이터셋을 생성합니다. 이를 바탕으로 DataClaw_0-9B 모델은 지도 미세 조정(Supervised Fine-Tuning, SFT)과 그룹 상대 정책 최적화(Group Relative Policy Optimization, GRPO)를 시너지 효과를 내도록 결합하여, 복잡한 정제 및 테일러링 의도와의 강건한 정렬을 달성합니다. 이 능력을 체계적으로 정량화하기 위해, 우리는 데이터 정제 전용 최초의 벤치마크인 DataClaw_0-val을 구축합니다. 결정적으로, 우리는 다운스트림 사후 학습을 최종 검증의 시금석으로 채택합니다. 비디오 생성, 실제 세계 VQA, GUI 탐색에 대한 평가는 DataClaw_0이 고정보밀도의 맞춤형 데이터를 제공하여, 제한된 훈련 데이터 환경에서 새로운 작업에 대한 효율적인 모델 적응을 촉진함을 확인합니다. 프로젝트 페이지: https://czjdsg.github.io/MakeAnyData

English

Massive unstructured multimodal streams suffer from high "data entropy," impeding both efficient human knowledge acquisition and high-quality AI post-training. Existing passive annotation paradigms, heavily reliant on heuristic rules or general VLMs, are costly, monotonous, and fail to unlock the deep procedural logic embedded in raw data. We elevate data processing to a learnable capability, proposing a paradigm shift towards Agentic Data Tailoring, which actively refining and structuring data to align with diverse user and downstream intents. To overcome the data scarcity bottleneck in training such high-order capabilities, we design a two-stage pipeline grounding generative semantic synthesis in deterministic Factual Anchors, yielding a large-scale dataset spanning five core physical and digital domains. Building upon this, DataClaw_0-9B model synergizes Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), achieving robust alignment with complex refinement and tailoring intents. To systematically quantify this capability, we construct DataClaw_0-val, the first benchmark dedicated to data refinement. Crucially, we adopt downstream post-training as the ultimate validation touchstone. Evaluations on video generation, real-world VQA, and GUI navigation confirm that DataClaw_0 delivers high-information-density tailored data, facilitating efficient model adaptation to new tasks under limited training data regimes. Project page: https://czjdsg.github.io/MakeAnyData