arXiv cs.LG (Machine Learning) INT ai 2026-05-08 13:00

乗法的相互作用に隠された脆弱性：多モーダル対照学習の発見

原題: Hidden in the Multiplicative Interaction: Uncovering Fragility in Multimodal Contrastive Learning

分析結果

カテゴリ: 教育
重要度: 59
トレンドスコア: 18
要約: 対照学習は、ペアデータからの教師なし学習の標準的なアプローチとなっており、特にCLIPによる画像とテキストのマッチングでその効果が示されています。しかし、多くの分野では、対照学習の脆弱性が存在し、特に多モーダルデータにおいてはその影響が顕著です。本研究では、これらの脆弱性を明らかにし、対照学習の改善に向けた新たな視点を提供します。
キーワード: contrastive learning symile modality fragility than two modalities

arXiv:2604.05834v2 Announce Type: replace Abstract: Contrastive learning has become a standard approach for unsupervised learning from paired data, as demonstrated by CLIP for image-text matching. However, many domains involve more than two modalities and require objectives that capture higher-order dependencies beyond pairwise alignment. Symile extends CLIP to this setting by replacing the dot product with the multilinear inner product (MIP) over modality embeddings. In this work, we show that there is a fragility which ishidden in the multiplicative interaction: a single weakly informative, misaligned, or missing modality can propagate through the objective and distort cross-modal retrieval scores. We propose Gated Symile, a contrastive gating mechanism that adapts modality contributions on an attention-based, per-candidate basis. The gate suppresses unreliable inputs by interpolating embeddings toward learnable neutral directions with an explicit NULL option when reliable cross-modal alignment is unlikely. Across a controlled synthetic benchmark that uncovers this fragility and three real-world trimodal datasets, Gated Symile achieves higher top-1 retrieval accuracy than well-tuned state-of-the-art (sota) baselines. More broadly, our results highlight gating as a step toward robust multimodal contrastive learning beyond two modalities in the presence of noise, misalignment, or missing inputs. arXiv:2604.05834v2 Announce Type: replace Abstract: Contrastive learning has become a standard approach for unsupervised learning from paired data, as demonstrated by CLIP for image-text matching. However, many domains involve more than two modalities and require objectives that capture higher-order dependencies beyond pairwise alignment. Symile extends CLIP to this setting by replacing the dot product with the multilinear inner product (MIP) over modality embeddings. In this work, we show that there is a fragility which ishidden in the multiplicative interaction: a single weakly informative, misaligned, or missing modality can propagate through the objective and distort cross-modal retrieval scores. We propose Gated Symile, a contrastive gating mechanism that adapts modality contributions on an attention-based, per-candidate basis. The gate suppresses unreliable inputs by interpolating embeddings toward learnable neutral directions with an explicit NULL option when reliable cross-modal alignment is unlikely. Across a controlled synthetic benchmark that uncovers this fragility and three real-world trimodal datasets, Gated Symile achieves higher top-1 retrieval accuracy than well-tuned state-of-the-art (sota) baselines. More broadly, our results highlight gating as a step toward robust multimodal contrastive learning beyond two modalities in the presence of noise, misalignment, or missing inputs.