LLM-Grounder: 대형 언어 모델을 에이전트로 활용한 오픈 어휘 3D 시각적 그라운딩

초록

3D 시각적 그라운딩은 가정용 로봇이 주변 환경을 기반으로 이동, 물체 조작, 질문에 답변하는 데 필수적인 기술입니다. 기존 접근 방식은 대량의 레이블 데이터에 의존하거나 복잡한 언어 질의를 처리하는 데 한계를 보이는 반면, 우리는 LLM-Grounder라는 새로운 제로샷, 오픈 어휘, 대형 언어 모델(LLM) 기반 3D 시각적 그라운딩 파이프라인을 제안합니다. LLM-Grounder는 LLM을 사용하여 복잡한 자연어 질의를 의미 구성 요소로 분해하고, OpenScene이나 LERF와 같은 시각적 그라운딩 도구를 활용하여 3D 장면 내의 객체를 식별합니다. 이후 LLM은 제안된 객체들 간의 공간적 및 상식적 관계를 평가하여 최종 그라운딩 결정을 내립니다. 우리의 방법은 레이블된 학습 데이터가 필요하지 않으며, 새로운 3D 장면과 임의의 텍스트 질의로 일반화할 수 있습니다. 우리는 LLM-Grounder를 ScanRefer 벤치마크에서 평가하고, 최신의 제로샷 그라운딩 정확도를 입증했습니다. 연구 결과에 따르면, LLM은 특히 복잡한 언어 질의에서 그라운딩 능력을 크게 향상시켜, LLM-Grounder가 로보틱스의 3D 시각-언어 작업에 효과적인 접근 방식임을 보여줍니다. 비디오 및 인터랙티브 데모는 프로젝트 웹사이트 https://chat-with-nerf.github.io/에서 확인할 수 있습니다.

English

3D visual grounding is a critical skill for household robots, enabling them to navigate, manipulate objects, and answer questions based on their environment. While existing approaches often rely on extensive labeled data or exhibit limitations in handling complex language queries, we propose LLM-Grounder, a novel zero-shot, open-vocabulary, Large Language Model (LLM)-based 3D visual grounding pipeline. LLM-Grounder utilizes an LLM to decompose complex natural language queries into semantic constituents and employs a visual grounding tool, such as OpenScene or LERF, to identify objects in a 3D scene. The LLM then evaluates the spatial and commonsense relations among the proposed objects to make a final grounding decision. Our method does not require any labeled training data and can generalize to novel 3D scenes and arbitrary text queries. We evaluate LLM-Grounder on the ScanRefer benchmark and demonstrate state-of-the-art zero-shot grounding accuracy. Our findings indicate that LLMs significantly improve the grounding capability, especially for complex language queries, making LLM-Grounder an effective approach for 3D vision-language tasks in robotics. Videos and interactive demos can be found on the project website https://chat-with-nerf.github.io/ .

LLM-Grounder: 대형 언어 모델을 에이전트로 활용한 오픈 어휘 3D 시각적 그라운딩

LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent

초록

Support