HLFormer:通过双曲学习提升部分相关视频检索能力
HLFormer: Enhancing Partially Relevant Video Retrieval with Hyperbolic Learning
July 23, 2025
作者: Li Jun, Wang Jinpeng, Tan Chaolei, Lian Niu, Chen Long, Zhang Min, Wang Yaowei, Xia Shu-Tao, Chen Bin
cs.AI
摘要
部分相关视频检索(PRVR)致力于解决一个关键挑战:将未经剪辑的视频与仅描述部分内容的文本查询相匹配。现有方法在欧几里得空间中存在几何失真问题,有时会误传视频的内在层次结构,并忽视某些层次语义,最终导致时间建模效果欠佳。为解决这一问题,我们首次提出了针对PRVR的双曲建模框架——HLFormer,该框架利用双曲空间学习来弥补欧几里得空间在层次建模能力上的不足。具体而言,HLFormer结合了洛伦兹注意力块和欧几里得注意力块,在混合空间中编码视频嵌入,并通过均值引导的自适应交互模块动态融合特征。此外,我们引入了部分顺序保持损失,通过洛伦兹锥约束强化“文本<视频”的层次关系。这一方法通过加强视频内容与文本查询之间的部分相关性,进一步提升了跨模态匹配效果。大量实验表明,HLFormer在性能上超越了现有最先进的方法。代码已发布于https://github.com/lijun2005/ICCV25-HLFormer。
English
Partially Relevant Video Retrieval (PRVR) addresses the critical challenge of
matching untrimmed videos with text queries describing only partial content.
Existing methods suffer from geometric distortion in Euclidean space that
sometimes misrepresents the intrinsic hierarchical structure of videos and
overlooks certain hierarchical semantics, ultimately leading to suboptimal
temporal modeling. To address this issue, we propose the first hyperbolic
modeling framework for PRVR, namely HLFormer, which leverages hyperbolic space
learning to compensate for the suboptimal hierarchical modeling capabilities of
Euclidean space. Specifically, HLFormer integrates the Lorentz Attention Block
and Euclidean Attention Block to encode video embeddings in hybrid spaces,
using the Mean-Guided Adaptive Interaction Module to dynamically fuse features.
Additionally, we introduce a Partial Order Preservation Loss to enforce "text <
video" hierarchy through Lorentzian cone constraints. This approach further
enhances cross-modal matching by reinforcing partial relevance between video
content and text queries. Extensive experiments show that HLFormer outperforms
state-of-the-art methods. Code is released at
https://github.com/lijun2005/ICCV25-HLFormer.