ChatPaper.aiChatPaper

智人:人類視覺模型的基礎

Sapiens: Foundation for Human Vision Models

August 22, 2024
作者: Rawal Khirodkar, Timur Bagautdinov, Julieta Martinez, Su Zhaoen, Austin James, Peter Selednik, Stuart Anderson, Shunsuke Saito
cs.AI

摘要

我們提出 Sapiens,這是一個針對四個基本以人為中心的視覺任務的模型系列 - 包括 2D 姿勢估計、身體部位分割、深度估計和表面法向量預測。我們的模型原生支援 1K 高解析度推論,並且非常容易通過簡單微調在超過 3 億張野外人類圖像上預訓練的模型來適應個別任務。我們觀察到,在相同的計算預算下,對一個經過精心策劃的人類圖像數據集進行自監督預訓練顯著提升了多樣的以人為中心任務的性能。結果模型展現出對野外數據的卓越泛化能力,即使標註數據稀缺或完全是合成的情況下也是如此。我們簡單的模型設計還帶來了可擴展性 - 隨著參數數量從 0.3 億擴展到 20 億,模型在各任務上的性能都有所提升。Sapiens 在各種以人為中心的基準測試中始終優於現有基準。我們在 Humans-5K(姿勢)上相對 mAP 提高了 7.6%,Humans-2K(部位分割)上相對 mIoU 提高了 17.1%,Hi4D(深度)上相對 RMSE 提高了 22.4%,以及 THuman2(法向量)上相對角度誤差提高了 53.5%。
English
We present Sapiens, a family of models for four fundamental human-centric vision tasks - 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction. Our models natively support 1K high-resolution inference and are extremely easy to adapt for individual tasks by simply fine-tuning models pretrained on over 300 million in-the-wild human images. We observe that, given the same computational budget, self-supervised pretraining on a curated dataset of human images significantly boosts the performance for a diverse set of human-centric tasks. The resulting models exhibit remarkable generalization to in-the-wild data, even when labeled data is scarce or entirely synthetic. Our simple model design also brings scalability - model performance across tasks improves as we scale the number of parameters from 0.3 to 2 billion. Sapiens consistently surpasses existing baselines across various human-centric benchmarks. We achieve significant improvements over the prior state-of-the-art on Humans-5K (pose) by 7.6 mAP, Humans-2K (part-seg) by 17.1 mIoU, Hi4D (depth) by 22.4% relative RMSE, and THuman2 (normal) by 53.5% relative angular error.

Summary

AI-Generated Summary

PDF923November 16, 2024