原則に基づくLLM安全性テストに向けて:脱獄オラクル問題の解決
原題: Toward Principled LLM Safety Testing: Solving the Jailbreak Oracle Problem
分析結果
- カテゴリ
- AI
- 重要度
- 85
- トレンドスコア
- 34
- 要約
- 大規模言語モデル(LLM)が安全性が重要なアプリケーションでの利用が増える中、脱獄に対する脆弱性を評価する体系的な方法が不足しています。
- キーワード
- 長期重要性
- 数年で重要
- ビジネス可能性
- 高い - セキュリティ評価ツールやサービスの開発が期待される。
- 日本波及可能性
- 高 - 日本でもAIの安全性が重要視されており、関連技術の導入が進む可能性がある。
arXiv:2506.17299v2 Announce Type: replace-cross Abstract: As large language models (LLMs) become increasingly deployed in safety-critical applications, the lack of systematic methods to assess their vulnerability to jailbreak attacks presents a critical security gap. We introduce the jailbreak oracle problem: given a model, prompt, and decoding strategy, determine whether a jailbreak response can be generated with likelihood exceeding a specified threshold. This formalization enables a principled study of jailbreak vulnerabilities. Answering the jailbreak oracle problem poses significant computational challenges, as the search space grows exponentially with response length. We present Boa, the first system designed for efficiently solving the jailbreak oracle problem. Boa employs a two-phase search strategy: (1) breadth-first sampling to identify easily accessible jailbreaks, followed by (2) depth-first priority search guided by fine-grained safety scores to systematically explore promising yet low-probability paths. Boa enables rigorous security assessments including systematic defense evaluation, standardized comparison of red team attacks, and model certification under extreme adversarial conditions. Code is available at https://github.com/shuyilinn/BOA/tree/mlsys2026ae arXiv:2506.17299v2 Announce Type: replace-cross Abstract: As large language models (LLMs) become increasingly deployed in safety-critical applications, the lack of systematic methods to assess their vulnerability to jailbreak attacks presents a critical security gap. We introduce the jailbreak oracle problem: given a model, prompt, and decoding strategy, determine whether a jailbreak response can be generated with likelihood exceeding a specified threshold. This formalization enables a principled study of jailbreak vulnerabilities. Answering the jailbreak oracle problem poses significant computational challenges, as the search space grows exponentially with response length. We present Boa, the first system designed for efficiently solving the jailbreak oracle problem. Boa employs a two-phase search strategy: (1) breadth-first sampling to identify easily accessible jailbreaks, followed by (2) depth-first priority search guided by fine-grained safety scores to systematically explore promising yet low-probability paths. Boa enables rigorous security assessments including systematic defense evaluation, standardized comparison of red team attacks, and model certification under extreme adversarial conditions. Code is available at https://github.com/shuyilinn/BOA/tree/mlsys2026ae