智人:人类视觉模型的基础
Sapiens: Foundation for Human Vision Models
August 22, 2024
作者: Rawal Khirodkar, Timur Bagautdinov, Julieta Martinez, Su Zhaoen, Austin James, Peter Selednik, Stuart Anderson, Shunsuke Saito
cs.AI
摘要
我们提出了Sapiens,这是一组用于四项基本以人为中心的视觉任务的模型 - 2D姿势估计、身体部位分割、深度估计和表面法线预测。我们的模型原生支持1K高分辨率推断,并且通过简单地微调在超过3亿张野外人类图像上预训练的模型,非常容易适应个别任务。我们观察到,在相同的计算预算下,对经过筛选的人类图像数据集进行自监督预训练显著提升了多样的以人为中心的任务性能。由此产生的模型在野外数据上表现出显著的泛化能力,即使标记数据稀缺或完全是合成的情况下也是如此。我们简单的模型设计还带来了可扩展性 - 随着参数数量从0.3扩展到20亿,模型在各项任务上的性能都得到了提升。Sapiens在各种以人为中心的基准测试中始终优于现有基准。我们在Humans-5K(姿势)上相对于之前的最先进技术实现了7.6 mAP的显著改进,在Humans-2K(部分分割)上相对于之前的最先进技术实现了17.1 mIoU的显著改进,在Hi4D(深度)上相对根均方误差提高了22.4%,在THuman2(法线)上相对角度误差提高了53.5%。
English
We present Sapiens, a family of models for four fundamental human-centric
vision tasks - 2D pose estimation, body-part segmentation, depth estimation,
and surface normal prediction. Our models natively support 1K high-resolution
inference and are extremely easy to adapt for individual tasks by simply
fine-tuning models pretrained on over 300 million in-the-wild human images. We
observe that, given the same computational budget, self-supervised pretraining
on a curated dataset of human images significantly boosts the performance for a
diverse set of human-centric tasks. The resulting models exhibit remarkable
generalization to in-the-wild data, even when labeled data is scarce or
entirely synthetic. Our simple model design also brings scalability - model
performance across tasks improves as we scale the number of parameters from 0.3
to 2 billion. Sapiens consistently surpasses existing baselines across various
human-centric benchmarks. We achieve significant improvements over the prior
state-of-the-art on Humans-5K (pose) by 7.6 mAP, Humans-2K (part-seg) by 17.1
mIoU, Hi4D (depth) by 22.4% relative RMSE, and THuman2 (normal) by 53.5%
relative angular error.Summary
AI-Generated Summary