ChatPaper.aiChatPaper

神經驅動圖像編輯

Neural-Driven Image Editing

July 7, 2025
作者: Pengfei Zhou, Jie Xia, Xiaopeng Peng, Wangbo Zhao, Zilong Ye, Zekai Li, Suorong Yang, Jiadong Pan, Yuanxiang Chen, Ziqiao Wang, Kai Wang, Qian Zheng, Xiaojun Chang, Gang Pan, Shurong Dong, Kaipeng Zhang, Yang You
cs.AI

摘要

傳統圖像編輯通常依賴於手動提示,這使得該過程既耗時又對運動控制或語言能力有限的個體難以接近。借助腦機介面(BCIs)和生成模型的最新進展,我們提出了LoongX,一種由多模態神經生理信號驅動的免手動圖像編輯方法。LoongX利用基於23,928對圖像編輯數據集訓練的最新擴散模型,每對數據均配備了同步的腦電圖(EEG)、功能性近紅外光譜(fNIRS)、光電容積描記(PPG)及捕捉用戶意圖的頭部運動信號。為有效處理這些信號的異質性,LoongX整合了兩個關鍵模組:跨尺度狀態空間(CS3)模組,用於編碼信息豐富的模態特徵;動態門控融合(DGF)模組,進一步將這些特徵聚合到一個統一的潛在空間,並通過在擴散變壓器(DiT)上的微調與編輯語義對齊。此外,我們利用對比學習預訓練編碼器,以將認知狀態與嵌入自然語言的語義意圖對齊。大量實驗表明,LoongX在性能上可與文本驅動方法相媲美(CLIP-I:0.6605對比0.6558;DINO:0.4812對比0.4636),並在神經信號與語音結合時超越它們(CLIP-T:0.2588對比0.2549)。這些結果凸顯了神經驅動生成模型在實現可訪問、直觀圖像編輯方面的潛力,並為認知驅動的創意技術開闢了新方向。數據集和代碼將被公開,以支持未來工作並促進這一新興領域的進步。
English
Traditional image editing typically relies on manual prompting, making it labor-intensive and inaccessible to individuals with limited motor control or language abilities. Leveraging recent advances in brain-computer interfaces (BCIs) and generative models, we propose LoongX, a hands-free image editing approach driven by multimodal neurophysiological signals. LoongX utilizes state-of-the-art diffusion models trained on a comprehensive dataset of 23,928 image editing pairs, each paired with synchronized electroencephalography (EEG), functional near-infrared spectroscopy (fNIRS), photoplethysmography (PPG), and head motion signals that capture user intent. To effectively address the heterogeneity of these signals, LoongX integrates two key modules. The cross-scale state space (CS3) module encodes informative modality-specific features. The dynamic gated fusion (DGF) module further aggregates these features into a unified latent space, which is then aligned with edit semantics via fine-tuning on a diffusion transformer (DiT). Additionally, we pre-train the encoders using contrastive learning to align cognitive states with semantic intentions from embedded natural language. Extensive experiments demonstrate that LoongX achieves performance comparable to text-driven methods (CLIP-I: 0.6605 vs. 0.6558; DINO: 0.4812 vs. 0.4636) and outperforms them when neural signals are combined with speech (CLIP-T: 0.2588 vs. 0.2549). These results highlight the promise of neural-driven generative models in enabling accessible, intuitive image editing and open new directions for cognitive-driven creative technologies. Datasets and code will be released to support future work and foster progress in this emerging area.
PDF231July 14, 2025