cadrille: 온라인 강화 학습을 통한 다중 모달 CAD 재구성

초록

컴퓨터 지원 설계(CAD)는 정밀하고 편집 가능한 3D 모델을 생성할 수 있도록 하여 엔지니어링 및 제조 분야에서 중심적인 역할을 합니다. 다양한 센서 또는 사용자 제공 데이터를 CAD 재구성을 위한 입력으로 사용하면 설계 애플리케이션에 대한 접근성을 민주화할 수 있습니다. 그러나 기존 방법들은 일반적으로 포인트 클라우드, 이미지 또는 텍스트와 같은 단일 입력 양식에 초점을 맞추어 일반화성과 견고성이 제한됩니다. 최근 비전-언어 모델(VLM)의 발전을 활용하여, 우리는 세 가지 입력 양식을 동시에 처리하는 다중 모달 CAD 재구성 모델을 제안합니다. 대형 언어 모델(LLM) 훈련 패러다임에서 영감을 받아, 우리는 두 단계 파이프라인을 채택합니다: 대규모 절차적 생성 데이터에 대한 지도 미세 조정(SFT)과, 프로그램적으로 얻은 온라인 피드백을 사용한 강화 학습(RL) 미세 조정입니다. 더 나아가, 우리는 CAD 작업을 위한 LLM의 RL 미세 조정을 처음으로 탐구하며, Group Relative Preference Optimization (GRPO)와 같은 온라인 RL 알고리즘이 오프라인 대안을 능가함을 입증합니다. DeepCAD 벤치마크에서, 우리의 SFT 모델은 기존의 단일 모달 접근법을 세 가지 입력 양식 모두에서 동시에 능가합니다. 더 중요한 것은, RL 미세 조정 후에 cadrille은 실제 세계 데이터셋을 포함한 세 가지 도전적인 데이터셋에서 새로운 최첨단 성능을 달성합니다.

English

Computer-Aided Design (CAD) plays a central role in engineering and manufacturing, making it possible to create precise and editable 3D models. Using a variety of sensor or user-provided data as inputs for CAD reconstruction can democratize access to design applications. However, existing methods typically focus on a single input modality, such as point clouds, images, or text, which limits their generalizability and robustness. Leveraging recent advances in vision-language models (VLM), we propose a multi-modal CAD reconstruction model that simultaneously processes all three input modalities. Inspired by large language model (LLM) training paradigms, we adopt a two-stage pipeline: supervised fine-tuning (SFT) on large-scale procedurally generated data, followed by reinforcement learning (RL) fine-tuning using online feedback, obtained programatically. Furthermore, we are the first to explore RL fine-tuning of LLMs for CAD tasks demonstrating that online RL algorithms such as Group Relative Preference Optimization (GRPO) outperform offline alternatives. In the DeepCAD benchmark, our SFT model outperforms existing single-modal approaches in all three input modalities simultaneously. More importantly, after RL fine-tuning, cadrille sets new state-of-the-art on three challenging datasets, including a real-world one.