LLM-Grounder：以大型語言模型為代理人的開放詞彙3D視覺定位

摘要

3D視覺定位是家用機器人的重要技能，使它們能夠在環境中導航、操作物體並根據環境回答問題。儘管現有方法通常依賴大量標記數據或在處理複雜語言查詢方面存在限制，我們提出了LLM-Grounder，這是一種新型的零樣本、開放詞彙庫、基於大型語言模型（LLM）的3D視覺定位管道。LLM-Grounder利用LLM將複雜的自然語言查詢分解為語義成分，並使用視覺定位工具，如OpenScene或LERF，來識別3D場景中的物體。然後，LLM評估所提議物體之間的空間和常識關係，以做出最終的定位決策。我們的方法不需要任何標記的訓練數據，可以推廣應用於新奇的3D場景和任意文本查詢。我們在ScanRefer基準測試上評估了LLM-Grounder，展示了最先進的零樣本定位準確性。我們的研究結果表明，LLM顯著提高了定位能力，特別是對於複雜語言查詢，使LLM-Grounder成為機器人三維視覺語言任務的有效方法。有關視頻和互動演示，請訪問項目網站https://chat-with-nerf.github.io/。

English

3D visual grounding is a critical skill for household robots, enabling them to navigate, manipulate objects, and answer questions based on their environment. While existing approaches often rely on extensive labeled data or exhibit limitations in handling complex language queries, we propose LLM-Grounder, a novel zero-shot, open-vocabulary, Large Language Model (LLM)-based 3D visual grounding pipeline. LLM-Grounder utilizes an LLM to decompose complex natural language queries into semantic constituents and employs a visual grounding tool, such as OpenScene or LERF, to identify objects in a 3D scene. The LLM then evaluates the spatial and commonsense relations among the proposed objects to make a final grounding decision. Our method does not require any labeled training data and can generalize to novel 3D scenes and arbitrary text queries. We evaluate LLM-Grounder on the ScanRefer benchmark and demonstrate state-of-the-art zero-shot grounding accuracy. Our findings indicate that LLMs significantly improve the grounding capability, especially for complex language queries, making LLM-Grounder an effective approach for 3D vision-language tasks in robotics. Videos and interactive demos can be found on the project website https://chat-with-nerf.github.io/ .

LLM-Grounder：以大型語言模型為代理人的開放詞彙3D視覺定位

LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent

摘要

Support