CM-EVS：面向完整场景覆蓋的稀疏全景RGB-D-姿態數據

摘要

现代三维视觉学习依赖于从度量3D资产中采样的观测数据，然而现有的扫描、网格、点云、仿真及重建结果并未直接提供稀疏、可比较且几何一致的全景训练接口。密集轨迹会重复邻近视角，特定来源的渲染策略会导致异质标注，而稀疏启发式方法可能遗漏关键区域或引入深度不一致的观测。本文研究如何将3D资产转换为稀疏全景RGB-D-姿态数据，在保持完整场景覆盖的同时实现低冗余和可追溯性。我们提出COVER（覆盖导向视点筛选结合ERP范围深度扭曲），这是一种无需训练的ERP视点筛选器，通过将选定视角观测到的几何投影至候选ERP探针、对增量覆盖评分并对深度冲突施加惩罚。在有界代理误差条件下，其贪婪覆盖代理机制可保留标准覆盖式逼近行为（附加误差项可控）。基于COVER，我们构建了CM-EVS（覆盖筛选度量ERP视点集），这是一个包含36,373个筛选ERP帧的全景RGB-D-姿态数据集，涵盖Blender室内、HM3D和ScanNet++的1,275个室内场景，并补充了TartanGround和OB3D中重新编码为相同架构的室外全景图。每个帧提供全球面RGB、度量范围深度及标定姿态；COVER生成的室内帧附带逐步骤可追溯日志。CM-EVS室内场景中位数仅需25帧即可覆盖全部13种统一房间类型，同时保持紧凑的场景级覆盖。实验表明，COVER优化了覆盖-冲突权衡，使CM-EVS成为面向几何一致全景三维学习的稀疏、紧凑且可追溯的RGB-D-姿态资源。

English

Modern 3D visual learning relies on observations sampled from metric 3D assets, yet existing scans, meshes, point clouds, simulations, and reconstructions do not directly provide a sparse, comparable, and geometry-consistent panoramic training interface. Dense trajectories duplicate nearby views, source-specific rendering policies yield heterogeneous annotations, and sparse heuristics may miss important regions or introduce depth-inconsistent observations. We study how to convert 3D assets into sparse panoramic RGB-D-pose data that preserves complete scene coverage with low redundancy and auditable provenance. We propose COVER (Coverage-Oriented Viewpoint curation with ERP Range-depth warping), a training-free ERP viewpoint curator that projects geometry observed from selected views into candidate ERP probes, scores incremental coverage, and penalizes depth conflicts. Under bounded proxy error, its greedy coverage proxy preserves the standard coverage-style approximation behavior up to an additive error term. Using COVER, we build CM-EVS (Coverage-curated Metric ERP View Set), a panoramic RGB-D-pose dataset with 36,373 curated ERP frames from 1,275 indoor scenes across Blender indoor, HM3D, and ScanNet++, complemented by outdoor panoramas from TartanGround and OB3D re-encoded into the same schema. Each frame provides full-sphere RGB, metric range depth, calibrated pose; COVER-produced indoor frames include per-step provenance logs. With a median of only 25 frames per indoor scene, CM-EVS covers all 13 unified room types while maintaining compact scene-level coverage. Experiments show that COVER improves the coverage-conflict trade-off, making CM-EVS a sparse, compact, and auditable RGB-D-pose resource for geometry-consistent panoramic 3D learning.