HLFormer:利用雙曲學習提升部分相關視頻檢索效能
HLFormer: Enhancing Partially Relevant Video Retrieval with Hyperbolic Learning
July 23, 2025
作者: Li Jun, Wang Jinpeng, Tan Chaolei, Lian Niu, Chen Long, Zhang Min, Wang Yaowei, Xia Shu-Tao, Chen Bin
cs.AI
摘要
部分相關視頻檢索(PRVR)致力於解決一個關鍵挑戰,即如何將未經剪輯的視頻與僅描述部分內容的文本查詢相匹配。現有方法在歐幾里得空間中常遭遇幾何失真,這有時會誤導視頻內在的層次結構,並忽略某些層次語義,最終導致時間建模效果不佳。為解決這一問題,我們提出了首個針對PRVR的雙曲建模框架,名為HLFormer,該框架利用雙曲空間學習來彌補歐幾里得空間在層次建模能力上的不足。具體而言,HLFormer整合了洛倫茲注意力塊和歐幾里得注意力塊,在混合空間中編碼視頻嵌入,並採用均值引導的自適應交互模塊動態融合特徵。此外,我們引入了部分序保持損失,通過洛倫茲錐約束來強化“文本<視頻”的層次關係。這一方法進一步增強了視頻內容與文本查詢之間的部分相關性,從而提升了跨模態匹配的效果。大量實驗表明,HLFormer在性能上超越了現有的最先進方法。相關代碼已發佈於https://github.com/lijun2005/ICCV25-HLFormer。
English
Partially Relevant Video Retrieval (PRVR) addresses the critical challenge of
matching untrimmed videos with text queries describing only partial content.
Existing methods suffer from geometric distortion in Euclidean space that
sometimes misrepresents the intrinsic hierarchical structure of videos and
overlooks certain hierarchical semantics, ultimately leading to suboptimal
temporal modeling. To address this issue, we propose the first hyperbolic
modeling framework for PRVR, namely HLFormer, which leverages hyperbolic space
learning to compensate for the suboptimal hierarchical modeling capabilities of
Euclidean space. Specifically, HLFormer integrates the Lorentz Attention Block
and Euclidean Attention Block to encode video embeddings in hybrid spaces,
using the Mean-Guided Adaptive Interaction Module to dynamically fuse features.
Additionally, we introduce a Partial Order Preservation Loss to enforce "text <
video" hierarchy through Lorentzian cone constraints. This approach further
enhances cross-modal matching by reinforcing partial relevance between video
content and text queries. Extensive experiments show that HLFormer outperforms
state-of-the-art methods. Code is released at
https://github.com/lijun2005/ICCV25-HLFormer.