神经驱动的图像编辑
Neural-Driven Image Editing
July 7, 2025
作者: Pengfei Zhou, Jie Xia, Xiaopeng Peng, Wangbo Zhao, Zilong Ye, Zekai Li, Suorong Yang, Jiadong Pan, Yuanxiang Chen, Ziqiao Wang, Kai Wang, Qian Zheng, Xiaojun Chang, Gang Pan, Shurong Dong, Kaipeng Zhang, Yang You
cs.AI
摘要
传统图像编辑通常依赖于手动提示,这使得该过程既费时又对运动控制或语言能力受限的个人难以企及。借助脑机接口(BCIs)和生成模型的最新进展,我们提出了LoongX,一种基于多模态神经生理信号的无手操作图像编辑方法。LoongX利用最先进的扩散模型,该模型在包含23,928对图像编辑样本的全面数据集上训练,每对样本均配有同步的脑电图(EEG)、功能性近红外光谱(fNIRS)、光电容积描记术(PPG)以及捕捉用户意图的头部运动信号。为有效应对这些信号的异质性,LoongX整合了两个关键模块:跨尺度状态空间(CS3)模块负责编码各模态特有的信息特征;动态门控融合(DGF)模块进一步将这些特征聚合至统一潜在空间,并通过在扩散变换器(DiT)上的微调与编辑语义对齐。此外,我们采用对比学习预训练编码器,以将认知状态与嵌入自然语言的语义意图对齐。大量实验表明,LoongX在性能上可与文本驱动方法相媲美(CLIP-I:0.6605对比0.6558;DINO:0.4812对比0.4636),并在神经信号与语音结合时表现更优(CLIP-T:0.2588对比0.2549)。这些成果凸显了神经驱动生成模型在实现无障碍、直观图像编辑方面的潜力,为认知驱动的创意技术开辟了新方向。我们将发布数据集和代码,以支持未来工作并推动这一新兴领域的发展。
English
Traditional image editing typically relies on manual prompting, making it
labor-intensive and inaccessible to individuals with limited motor control or
language abilities. Leveraging recent advances in brain-computer interfaces
(BCIs) and generative models, we propose LoongX, a hands-free image editing
approach driven by multimodal neurophysiological signals. LoongX utilizes
state-of-the-art diffusion models trained on a comprehensive dataset of 23,928
image editing pairs, each paired with synchronized electroencephalography
(EEG), functional near-infrared spectroscopy (fNIRS), photoplethysmography
(PPG), and head motion signals that capture user intent. To effectively address
the heterogeneity of these signals, LoongX integrates two key modules. The
cross-scale state space (CS3) module encodes informative modality-specific
features. The dynamic gated fusion (DGF) module further aggregates these
features into a unified latent space, which is then aligned with edit semantics
via fine-tuning on a diffusion transformer (DiT). Additionally, we pre-train
the encoders using contrastive learning to align cognitive states with semantic
intentions from embedded natural language. Extensive experiments demonstrate
that LoongX achieves performance comparable to text-driven methods (CLIP-I:
0.6605 vs. 0.6558; DINO: 0.4812 vs. 0.4636) and outperforms them when neural
signals are combined with speech (CLIP-T: 0.2588 vs. 0.2549). These results
highlight the promise of neural-driven generative models in enabling
accessible, intuitive image editing and open new directions for
cognitive-driven creative technologies. Datasets and code will be released to
support future work and foster progress in this emerging area.