Sapiens2

摘要

我们推出Sapiens2——一个专注于泛化性、多功能性及高保真输出的人本视觉高分辨率Transformer模型家族。该系列模型参数量从4亿至50亿不等，原生支持1K分辨率，其分层变体更可支持4K超高清。Sapiens2在预训练与后训练阶段均较前代实现显著提升。首先，为同时捕获低层级细节（用于密集预测）和高层级语义（用于零样本或少标签场景），我们融合掩码图像重建与自蒸馏对比学习目标。评估表明这种统一的预训练目标能更好地适应多样化下游任务。其次在数据层面，我们基于精心筛选的10亿张高质量人体图像进行预训练，并提升任务标注的质量与规模。第三在架构层面，我们引入前沿模型的先进技术，实现更稳定的长周期训练。4K模型采用窗口注意力机制以处理长空间上下文，并以2K输出分辨率进行预训练。Sapiens2在姿态估计（mAP提升4点）、身体部位分割（mIoU提升24.3点）、法线估计（角度误差降低45.6%）等任务上刷新业界纪录，并拓展至点云贴图与反射率估计等新任务。代码地址：https://github.com/facebookresearch/sapiens2

English

We present Sapiens2, a model family of high-resolution transformers for human-centric vision focused on generalization, versatility, and high-fidelity outputs. Our model sizes range from 0.4 to 5 billion parameters, with native 1K resolution and hierarchical variants that support 4K. Sapiens2 substantially improves over its predecessor in both pretraining and post-training. First, to learn features that capture low-level details (for dense prediction) and high-level semantics (for zero-shot or few-label settings), we combine masked image reconstruction with self-distilled contrastive objectives. Our evaluations show that this unified pretraining objective is better suited for a wider range of downstream tasks. Second, along the data axis, we pretrain on a curated dataset of 1 billion high-quality human images and improve the quality and quantity of task annotations. Third, architecturally, we incorporate advances from frontier models that enable longer training schedules with improved stability. Our 4K models adopt windowed attention to reason over longer spatial context and are pretrained with 2K output resolution. Sapiens2 sets a new state-of-the-art and improves over the first generation on pose (+4 mAP), body-part segmentation (+24.3 mIoU), normal estimation (45.6% lower angular error) and extends to new tasks such as pointmap and albedo estimation. Code: https://github.com/facebookresearch/sapiens2