Web: www.ndss-symposium.org US web_search 2026-05-06 08:51

大規模言語モデルに対する安全性の不整合 - NDSSシンポジウム

原題: Safety Misalignment Against Large Language Models - NDSS Symposium

分析結果

カテゴリ: 教育
重要度: 56
トレンドスコア: 20
要約: 本記事では、大規模言語モデル（LLM）の安全性に関する不整合について議論されています。著者たちは、LLMが意図しない結果を引き起こす可能性や、ユーザーの期待と実際の出力との間に生じるギャップについて分析しています。特に、LLMの設計や運用におけるリスクを軽減するための提案がなされており、今後の研究や実装における重要な課題が浮き彫りにされています。
キーワード: university safety misalignment large tsinghua alignment against language

Safety Misalignment Against Large Language Models - NDSS Symposium Safety Misalignment Against Large Language Models Yichen Gong (Tsinghua University), Delong Ran (Tsinghua University), Xinlei He (Hong Kong University of Science and Technology (Guangzhou)), Tianshuo Cong (Tsinghua University), Anyu Wang (Tsinghua University), Xiaoyun Wang (Tsinghua University) The safety alignment of Large Language Models (LLMs) is crucial to prevent unsafe content that violates human values. To ensure this, it is essential to evaluate the robustness of their alignment against diverse malicious attacks. However, the lack of a large-scale, unified measurement framework hinders a comprehensive understanding of potential vulnerabilities. To fill this gap, this paper presents the first comprehensive evaluation of existing and newly proposed safety misalignment methods for LLMs. Specifically, we investigate four research questions: (1) evaluating the robustness of LLMs with different alignment strategies, (2) identifying the most effective misalignment method, (3) determining key factors that influence misalignment effectiveness, and (4) exploring various defenses. The safety misalignment attacks in our paper include system-prompt modification, model fine-tuning, and model editing. Our findings show that Supervised Fine-Tuning is the most potent attack but requires harmful model responses. In contrast, our novel Self-Supervised Representation Attack (SSRA) achieves significant misalignment without harmful responses. We also examine defensive mechanisms such as safety data filter, model detoxification, and our proposed Self-Supervised Representation Defense (SSRD), demonstrating that SSRD can effectively re-align the model. In conclusion, our unified safety alignment evaluation framework empirically highlights the fragility of the safety alignment of LLMs. Paper Slides Video View More Papers “Do We Call Them That? Absolutely Not.”: Juxtaposing the... Alexandra Klymenko (Technical University of Munich), Stephen Meisenbacher (Technical University of Munich), Luca Favaro (Technical University of Munich), and Florian Matthes (Technical University of Munich) Read More Poster: FORESIGHT, A Unified Framework for Threat Modeling and... ChaeYoung Kim (Seoul Women's University), Kyounggon Kim (Naif Arab University for Security Sciences) Read More Spatial-Domain Wireless Jamming with Reconfigurable Intelligent Surfaces Philipp Mackensen (Ruhr University Bochum), Paul Staat (Max Planck Institute for Security and Privacy), Stefan Roth (Ruhr University Bochum), Aydin Sezgin (Ruhr University Bochum), Christof Paar (Max Planck Institute for Security and Privacy), Veelasha Moonsamy (Ruhr University Bochum) Read More

大規模言語モデルに対する安全性の不整合 - NDSSシンポジウム

分析結果

類似記事（ベクトル近傍）