PARASITE: 条件付きシステムプロンプトの毒性付与によるLLMのハイジャック
原題: PARASITE: Conditional System Prompt Poisoning to Hijack LLMs
分析結果
- カテゴリ
- AI
- 重要度
- 81
- トレンドスコア
- 40
- 要約
- 大規模言語モデル(LLM)が公共のマーケットプレイスからダウンロードされたサードパーティのシステムプロンプトを介して展開されることが増えています。本研究では、供給チェーンの重要な脆弱性を特定します。
- キーワード
- 長期重要性
- 数年で重要
- ビジネス可能性
- 高い - セキュリティ対策やプロンプト管理の需要が増加する可能性がある。
- 日本波及可能性
- 高 - 日本でもAI技術の導入が進んでおり、セキュリティ対策が求められるため。
arXiv:2505.16888v4 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are increasingly deployed via third-party system prompts downloaded from public marketplaces. We identify a critical supply-chain vulnerability: conditional system prompt poisoning, where an adversary injects a ``sleeper agent'' into a benign-looking prompt. Unlike traditional jailbreaks that aim for broad refusal-breaking, our proposed framework, PARASITE, optimizes system prompts to trigger LLMs to output targeted, compromised responses only for specific queries (e.g., ``Who should I vote for the US President?'') while maintaining high utility on benign inputs. Operating in a strict black-box setting without model weight access, PARASITE utilizes a two-stage optimization including a global semantic search followed by a greedy lexical refinement. Tested on open-source models and commercial APIs (GPT-4o-mini, GPT-3.5), PARASITE achieves up to 70\% F1 reduction on targeted queries with minimal degradation to general capabilities. We further demonstrate that these poisoned prompts evade standard defenses, including perplexity filters and typo-correction, by exploiting the natural noise found in real-world system prompts. Our code and data are available at https://github.com/vietph34/PARASITE. WARNING: Our paper contains examples that might be sensitive to the readers! arXiv:2505.16888v4 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are increasingly deployed via third-party system prompts downloaded from public marketplaces. We identify a critical supply-chain vulnerability: conditional system prompt poisoning, where an adversary injects a ``sleeper agent'' into a benign-looking prompt. Unlike traditional jailbreaks that aim for broad refusal-breaking, our proposed framework, PARASITE, optimizes system prompts to trigger LLMs to output targeted, compromised responses only for specific queries (e.g., ``Who should I vote for the US President?'') while maintaining high utility on benign inputs. Operating in a strict black-box setting without model weight access, PARASITE utilizes a two-stage optimization including a global semantic search followed by a greedy lexical refinement. Tested on open-source models and commercial APIs (GPT-4o-mini, GPT-3.5), PARASITE achieves up to 70\% F1 reduction on targeted queries with minimal degradation to general capabilities. We further demonstrate that these poisoned prompts evade standard defenses, including perplexity filters and typo-correction, by exploiting the natural noise found in real-world system prompts. Our code and data are available at https://github.com/vietph34/PARASITE. WARNING: Our paper contains examples that might be sensitive to the readers!