跟随任意物体：实时开放集检测、跟踪和跟随

摘要

跟踪和追踪感兴趣的对象对于几种机器人技术的应用至关重要，涵盖工业自动化、物流和仓储、医疗保健和安全等领域。本文介绍了一个机器人系统，能够实时检测、跟踪和追踪任何对象。我们提出的方法被称为“跟随任何物体”（FAn），是一种开放词汇和多模态模型，不限于训练时见过的概念，可以通过文本、图像或点击查询在推断时应用于新类别。利用大规模预训练模型（基础模型）提取丰富的视觉描述符，FAn可以通过将多模态查询（文本、图像、点击）与输入图像序列进行匹配来检测和分割对象。这些检测和分割的对象在图像帧之间进行跟踪，同时考虑遮挡和对象再出现。我们在一个真实世界的机器人系统（微型飞行器）上展示了FAn，并报告了其在实时控制循环中无缝跟踪感兴趣对象的能力。FAn可以部署在配备轻量级（6-8 GB）显卡的笔记本电脑上，实现每秒6-20帧的吞吐量。为了促进快速采用、部署和可扩展性，我们在项目网页https://github.com/alaamaalouf/FollowAnything 上开源了所有代码。我们还鼓励读者观看我们的5分钟解说视频，链接为https://www.youtube.com/watch?v=6Mgt3EPytrw。

English

Tracking and following objects of interest is critical to several robotics use cases, ranging from industrial automation to logistics and warehousing, to healthcare and security. In this paper, we present a robotic system to detect, track, and follow any object in real-time. Our approach, dubbed ``follow anything'' (FAn), is an open-vocabulary and multimodal model -- it is not restricted to concepts seen at training time and can be applied to novel classes at inference time using text, images, or click queries. Leveraging rich visual descriptors from large-scale pre-trained models (foundation models), FAn can detect and segment objects by matching multimodal queries (text, images, clicks) against an input image sequence. These detected and segmented objects are tracked across image frames, all while accounting for occlusion and object re-emergence. We demonstrate FAn on a real-world robotic system (a micro aerial vehicle) and report its ability to seamlessly follow the objects of interest in a real-time control loop. FAn can be deployed on a laptop with a lightweight (6-8 GB) graphics card, achieving a throughput of 6-20 frames per second. To enable rapid adoption, deployment, and extensibility, we open-source all our code on our project webpage at https://github.com/alaamaalouf/FollowAnything . We also encourage the reader the watch our 5-minutes explainer video in this https://www.youtube.com/watch?v=6Mgt3EPytrw .

跟随任意物体：实时开放集检测、跟踪和跟随

Follow Anything: Open-set detection, tracking, and following in real-time

摘要

Support