Global Trend Radar
arXiv cs.AI INT ai 2026-04-28 13:00

ソフトウェアのようにLLMをパッチ適用する:大規模言語モデルにおける安全ポリシー改善のための軽量な方法

原題: Patching LLM Like Software: A Lightweight Method for Improving Safety Policy in Large Language Models

元記事を開く →

分析結果

カテゴリ
AI
重要度
69
トレンドスコア
28
要約
本稿では、大規模言語モデル(LLM)の安全性の脆弱性に対処するための軽量でモジュール式のアプローチとして、ソフトウェアのバージョンのようにパッチ適用を提案します。
キーワード
arXiv:2511.08484v2 Announce Type: replace Abstract: We propose patching for large language models (LLMs) like software versions, a lightweight and modular approach for addressing safety vulnerabilities. While vendors release improved LLM versions, major releases are costly, infrequent, and difficult to tailor to customer needs, leaving released models with known safety gaps. Unlike full-model fine-tuning or major version updates, our method enables rapid remediation by prepending a compact, learnable prefix to an existing model. This "patch" introduces only 0.003% additional parameters, yet reliably steers model behavior toward that of a safer reference model. Across three critical domains (toxicity mitigation, bias reduction, and harmfulness refusal) policy patches achieve safety improvements comparable to next-generation safety-aligned models while preserving fluency. Our results demonstrate that LLMs can be "patched" much like software, offering vendors and practitioners a practical mechanism for distributing scalable, efficient, and composable safety updates between major model releases. arXiv:2511.08484v2 Announce Type: replace Abstract: We propose patching for large language models (LLMs) like software versions, a lightweight and modular approach for addressing safety vulnerabilities. While vendors release improved LLM versions, major releases are costly, infrequent, and difficult to tailor to customer needs, leaving released models with known safety gaps. Unlike full-model fine-tuning or major version updates, our method enables rapid remediation by prepending a compact, learnable prefix to an existing model. This "patch" introduces only 0.003% additional parameters, yet reliably steers model behavior toward that of a safer reference model. Across three critical domains (toxicity mitigation, bias reduction, and harmfulness refusal) policy patches achieve safety improvements comparable to next-generation safety-aligned models while preserving fluency. Our results demonstrate that LLMs can be "patched" much like software, offering vendors and practitioners a practical mechanism for distributing scalable, efficient, and composable safety updates between major model releases.

類似記事(ベクトル近傍)