ニューラル駆動型画像編集

要旨

従来の画像編集は、手動での指示に依存することが一般的であり、運動制御や言語能力に制約のある個人にとっては労力がかかり、アクセスが難しい状況であった。近年のブレイン・コンピュータ・インターフェース（BCI）と生成モデルの進展を活用し、我々はマルチモーダルな神経生理学的信号に基づくハンズフリー画像編集手法「LoongX」を提案する。LoongXは、23,928組の画像編集ペアからなる包括的なデータセットで訓練された最先端の拡散モデルを利用し、各ペアにはユーザーの意図を捉えるための同期された脳波（EEG）、機能的近赤外分光法（fNIRS）、光電式容積脈波記録法（PPG）、および頭部運動信号が含まれている。これらの信号の異質性を効果的に扱うため、LoongXは2つの主要モジュールを統合している。クロススケール状態空間（CS3）モジュールは、情報量の多いモダリティ固有の特徴を符号化する。動的ゲート融合（DGF）モジュールは、これらの特徴を統合された潜在空間に集約し、拡散トランスフォーマー（DiT）による微調整を通じて編集セマンティクスと整合させる。さらに、埋め込まれた自然言語から認知状態と意味意図を整合させるため、コントラスティブ学習を用いてエンコーダーを事前学習する。広範な実験により、LoongXはテキスト駆動手法と同等の性能（CLIP-I: 0.6605 vs. 0.6558; DINO: 0.4812 vs. 0.4636）を達成し、神経信号と音声を組み合わせた場合にはそれらを上回る（CLIP-T: 0.2588 vs. 0.2549）ことが示された。これらの結果は、神経駆動型生成モデルがアクセス可能で直感的な画像編集を実現する可能性を示唆し、認知駆動型クリエイティブ技術の新たな方向性を切り開くものである。今後の研究を支援し、この新興分野の進展を促進するため、データセットとコードが公開される予定である。

English

Traditional image editing typically relies on manual prompting, making it labor-intensive and inaccessible to individuals with limited motor control or language abilities. Leveraging recent advances in brain-computer interfaces (BCIs) and generative models, we propose LoongX, a hands-free image editing approach driven by multimodal neurophysiological signals. LoongX utilizes state-of-the-art diffusion models trained on a comprehensive dataset of 23,928 image editing pairs, each paired with synchronized electroencephalography (EEG), functional near-infrared spectroscopy (fNIRS), photoplethysmography (PPG), and head motion signals that capture user intent. To effectively address the heterogeneity of these signals, LoongX integrates two key modules. The cross-scale state space (CS3) module encodes informative modality-specific features. The dynamic gated fusion (DGF) module further aggregates these features into a unified latent space, which is then aligned with edit semantics via fine-tuning on a diffusion transformer (DiT). Additionally, we pre-train the encoders using contrastive learning to align cognitive states with semantic intentions from embedded natural language. Extensive experiments demonstrate that LoongX achieves performance comparable to text-driven methods (CLIP-I: 0.6605 vs. 0.6558; DINO: 0.4812 vs. 0.4636) and outperforms them when neural signals are combined with speech (CLIP-T: 0.2588 vs. 0.2549). These results highlight the promise of neural-driven generative models in enabling accessible, intuitive image editing and open new directions for cognitive-driven creative technologies. Datasets and code will be released to support future work and foster progress in this emerging area.

ニューラル駆動型画像編集

Neural-Driven Image Editing

要旨

Support