arXiv cs.LG (Machine Learning) INT ai 2026-06-26 13:00

部分的観測下での堅牢なペネトレーションテストポリシーの学習：体系的評価

原題: Learning Robust Penetration Testing Policies under Partial Observability: A systematic evaluation

分析結果

カテゴリ: 教育
重要度: 59
トレンドスコア: 18
要約: 本研究では、部分的な観測条件下でのペネトレーションテストポリシーの学習に関する体系的な評価を行います。ペネトレーションテストは、システムの脆弱性を評価するための重要な手法ですが、観測の制約がある場合、効果的なポリシーの学習が難しくなります。本論文では、さまざまな手法を比較し、部分的観測の影響を考慮した堅牢なポリシーの設計とその評価方法を提案します。
キーワード: policies real world penetration testing partial observability presents

arXiv:2509.20008v2 Announce Type: replace Abstract: Penetration testing, the simulation of cyberattacks to identify security vulnerabilities, presents a sequential decision-making problem well-suited for reinforcement learning (RL) automation. Like many applications of RL to real-world problems, partial observability presents a major challenge, as it invalidates the Markov property present in Markov Decision Processes (MDPs). Partially Observable MDPs require history aggregation or belief state estimation to learn successful policies. We investigate stochastic, partially observable penetration testing scenarios over host networks of varying size, aiming to better reflect real-world complexity through more challenging and representative benchmarks. This approach leads to the development of more robust and transferable policies, which are crucial for ensuring reliable performance across diverse and unpredictable real-world environments. Using vanilla Proximal Policy Optimization (PPO) as a baseline, we compare a selection of PPO-based variants designed to mitigate partial observability, including frame-stacking, augmenting observations with historical information, and employing LSTM or TrXL architectures. We conduct a systematic empirical analysis of these algorithms across different host network sizes. We find that this task greatly benefits from history aggregation. Converging up to four times faster than other approaches. Manual inspection of the learned policies by the algorithms reveals clear distinctions and provides insights that go beyond quantitative results. arXiv:2509.20008v2 Announce Type: replace Abstract: Penetration testing, the simulation of cyberattacks to identify security vulnerabilities, presents a sequential decision-making problem well-suited for reinforcement learning (RL) automation. Like many applications of RL to real-world problems, partial observability presents a major challenge, as it invalidates the Markov property present in Markov Decision Processes (MDPs). Partially Observable MDPs require history aggregation or belief state estimation to learn successful policies. We investigate stochastic, partially observable penetration testing scenarios over host networks of varying size, aiming to better reflect real-world complexity through more challenging and representative benchmarks. This approach leads to the development of more robust and transferable policies, which are crucial for ensuring reliable performance across diverse and unpredictable real-world environments. Using vanilla Proximal Policy Optimization (PPO) as a baseline, we compare a selection of PPO-based variants designed to mitigate partial observability, including frame-stacking, augmenting observations with historical information, and employing LSTM or TrXL architectures. We conduct a systematic empirical analysis of these algorithms across different host network sizes. We find that this task greatly benefits from history aggregation. Converging up to four times faster than other approaches. Manual inspection of the learned policies by the algorithms reveals clear distinctions and provides insights that go beyond quantitative results.