CM-EVS：用于完整场景覆盖的稀疏全景RGB-D-姿态数据

摘要

现代三维视觉学习依赖于从度量三维资产中采样的观测数据，然而现有的扫描、网格、点云、仿真和重建数据并未直接提供稀疏、可比较且几何一致的训练接口。密集轨迹会重复邻近视角，不同数据源的渲染策略会产生异构标注，而稀疏启发式方法可能遗漏关键区域或引入深度不一致的观测。本研究探讨如何将三维资产转化为稀疏全景RGB-D-姿态数据，在保持完整场景覆盖的同时实现低冗余和可追溯性。我们提出COVER（基于覆盖度的视点筛选与等矩形投影范围深度变换），这是一种无需训练的等矩形投影视点筛选器，它将选定视点观测到的几何信息投影到候选等矩形投影探针中，对增量覆盖进行评分，并对深度冲突进行惩罚。在有界代理误差条件下，其贪婪覆盖代理能保留标准覆盖式近似行为，仅存在附加误差项。基于COVER，我们构建了CM-EVS（覆盖度筛选的度量等矩形投影视图集），这是一个包含36,373个经筛选的等矩形投影帧的全景RGB-D-姿态数据集，涵盖来自Blender indoor、HM3D和ScanNet++的1,275个室内场景，以及来自TartanGround和OB3D（重新编码为相同模式）的室外全景图。每个帧提供全球面RGB、度量范围深度和标定姿态；COVER生成的室内帧包含每步的可追溯性日志。CM-EVS的室内场景中位数仅为25帧，覆盖全部13种统一房间类型，同时保持紧凑的场景级覆盖。实验表明，COVER改善了覆盖-冲突权衡，使CM-EVS成为用于几何一致全景三维学习的稀疏、紧凑且可追溯的RGB-D-姿态资源。

English

Modern 3D visual learning relies on observations sampled from metric 3D assets, yet existing scans, meshes, point clouds, simulations, and reconstructions do not directly provide a sparse, comparable, and geometry-consistent panoramic training interface. Dense trajectories duplicate nearby views, source-specific rendering policies yield heterogeneous annotations, and sparse heuristics may miss important regions or introduce depth-inconsistent observations. We study how to convert 3D assets into sparse panoramic RGB-D-pose data that preserves complete scene coverage with low redundancy and auditable provenance. We propose COVER (Coverage-Oriented Viewpoint curation with ERP Range-depth warping), a training-free ERP viewpoint curator that projects geometry observed from selected views into candidate ERP probes, scores incremental coverage, and penalizes depth conflicts. Under bounded proxy error, its greedy coverage proxy preserves the standard coverage-style approximation behavior up to an additive error term. Using COVER, we build CM-EVS (Coverage-curated Metric ERP View Set), a panoramic RGB-D-pose dataset with 36,373 curated ERP frames from 1,275 indoor scenes across Blender indoor, HM3D, and ScanNet++, complemented by outdoor panoramas from TartanGround and OB3D re-encoded into the same schema. Each frame provides full-sphere RGB, metric range depth, calibrated pose; COVER-produced indoor frames include per-step provenance logs. With a median of only 25 frames per indoor scene, CM-EVS covers all 13 unified room types while maintaining compact scene-level coverage. Experiments show that COVER improves the coverage-conflict trade-off, making CM-EVS a sparse, compact, and auditable RGB-D-pose resource for geometry-consistent panoramic 3D learning.