FASH-iCNN：マルチモーダルCNNプロービングによる編集的ファッションアイデンティティの可視化

要旨

ファッションAIシステムは、特定のブランド、編集者、歴史的瞬間の美的論理を開示することなく日常的に符号化している。本研究では、1991年から2024年にわたる15のファッションブランドの87,547枚のVogueランウェイ画像で学習したマルチモーダルシステムFASH-iCNNを提案する。本システムはこの文化的論理を検証可能にする。衣服の写真を入力すると、システムはそれを生産したブランド、属する時代、反映する色彩伝統を特定する。衣服のみに特化したモデルでは、14ブランドにわたるブランド識別トップ1精度78.2%、年代識別トップ1精度88.6%、34年間にわたる特定年識別トップ1精度58.3%（平均誤差はわずか2.2年）を達成した。どの視覚チャネルがこの信号を伝達するかを調査した結果、顕著な解離が明らかになった：色彩情報を除去してもブランド同一性精度は10.6ppしか低下しないのに対し、テクスチャ情報を除去すると37.6pp低下し、テクスチャと輝度が編集的アイデンティティの主要な伝達手段であることが確認された。FASH-iCNNは編集文化を背景ノイズではなく信号として扱い、各出力を形成したブランド、時代、色彩伝統を特定する。これにより、ユーザーはシステムの予測結果だけでなく、その予測に符号化されたブランド、編集者、歴史的瞬間を可視化できる。

English

Fashion AI systems routinely encode the aesthetic logic of specific houses, editors, and historical moments without disclosing it. We present FASH-iCNN, a multimodal system trained on 87,547 Vogue runway images across 15 fashion houses spanning 1991-2024 that makes this cultural logic inspectable. Given a photograph of a garment, the system recovers which house produced it, which era it belongs to, and which color tradition it reflects. A clothing-only model identifies the fashion house at 78.2% top-1 across 14 houses, the decade at 88.6% top-1, and the specific year at 58.3% top-1 across 34 years with a mean error of just 2.2 years. Probing which visual channels carry this signal reveals a sharp dissociation: removing color costs only 10.6pp of house identity accuracy, while removing texture costs 37.6pp, establishing texture and luminance as the primary carriers of editorial identity. FASH-iCNN treats editorial culture as the signal rather than background noise, identifying which houses, eras, and color traditions shaped each output so that users can see not just what the system predicts but which houses, editors, and historical moments are encoded in that prediction.

FASH-iCNN：マルチモーダルCNNプロービングによる編集的ファッションアイデンティティの可視化

FASH-iCNN: Making Editorial Fashion Identity Inspectable Through Multimodal CNN Probing

要旨

Support