HLFormer：通过双曲学习提升部分相关视频检索能力

摘要

部分相关视频检索（PRVR）致力于解决一个关键挑战：将未经剪辑的视频与仅描述部分内容的文本查询相匹配。现有方法在欧几里得空间中存在几何失真问题，有时会误传视频的内在层次结构，并忽视某些层次语义，最终导致时间建模效果欠佳。为解决这一问题，我们首次提出了针对PRVR的双曲建模框架——HLFormer，该框架利用双曲空间学习来弥补欧几里得空间在层次建模能力上的不足。具体而言，HLFormer结合了洛伦兹注意力块和欧几里得注意力块，在混合空间中编码视频嵌入，并通过均值引导的自适应交互模块动态融合特征。此外，我们引入了部分顺序保持损失，通过洛伦兹锥约束强化“文本<视频”的层次关系。这一方法通过加强视频内容与文本查询之间的部分相关性，进一步提升了跨模态匹配效果。大量实验表明，HLFormer在性能上超越了现有最先进的方法。代码已发布于https://github.com/lijun2005/ICCV25-HLFormer。

English

Partially Relevant Video Retrieval (PRVR) addresses the critical challenge of matching untrimmed videos with text queries describing only partial content. Existing methods suffer from geometric distortion in Euclidean space that sometimes misrepresents the intrinsic hierarchical structure of videos and overlooks certain hierarchical semantics, ultimately leading to suboptimal temporal modeling. To address this issue, we propose the first hyperbolic modeling framework for PRVR, namely HLFormer, which leverages hyperbolic space learning to compensate for the suboptimal hierarchical modeling capabilities of Euclidean space. Specifically, HLFormer integrates the Lorentz Attention Block and Euclidean Attention Block to encode video embeddings in hybrid spaces, using the Mean-Guided Adaptive Interaction Module to dynamically fuse features. Additionally, we introduce a Partial Order Preservation Loss to enforce "text < video" hierarchy through Lorentzian cone constraints. This approach further enhances cross-modal matching by reinforcing partial relevance between video content and text queries. Extensive experiments show that HLFormer outperforms state-of-the-art methods. Code is released at https://github.com/lijun2005/ICCV25-HLFormer.

HLFormer：通过双曲学习提升部分相关视频检索能力

HLFormer: Enhancing Partially Relevant Video Retrieval with Hyperbolic Learning

摘要

Support