Global Trend Radar
arXiv cs.LG (Machine Learning) INT ai 2026-06-26 13:00

二人プレイヤーゼロサム確率ゲームにおける分散型ベストレスポンス学習:有限サンプル分析

原題: Decentralized Best-Response-Based Learning in Two-Player Zero-Sum Stochastic Games: A Finite-Sample Analysis

元記事を開く →

分析結果

カテゴリ
教育
重要度
53
トレンドスコア
12
要約
本研究では、二人プレイヤーのゼロサム確率ゲームにおける分散型ベストレスポンス学習の有限サンプル分析を行います。プレイヤーは相手の戦略に基づいて最適な応答を学習し、ゲームの結果に影響を与える方法を探ります。特に、有限のサンプルサイズにおける学習の収束性や効率性について考察し、理論的な結果を実証的に検証します。
キーワード
arXiv:2409.01447v3 Announce Type: replace Abstract: We present a finite-sample analysis of decentralized learning in two-player zero-sum matrix games and stochastic games, with a focus on best-response-based learning algorithms. In matrix games, the learning algorithm is payoff-based and symmetric: each player updates its policy using only its own payoff observations, incrementally moving toward an estimated smoothed best response to the opponent's latest policy. For stochastic games, we build on this matrix-game primitive to develop a learning algorithm called value iteration with smoothed best response (VI-SBR), which combines smoothed-best-response learning in induced matrix games with a decentralized, model-free approximation of minimax value iteration. We establish finite-sample guarantees in both settings. For matrix games, our results imply a sample complexity of $\mathcal{O}(\epsilon^{-1})$ for finding an $\epsilon$-Nash distribution and, with explicit exploration, $\tilde{\mathcal{O}}(\epsilon^{-8})$ for finding an $\epsilon$-Nash equilibrium. For stochastic games, we prove that the exploration-enhanced VI-SBR algorithm achieves a sample complexity of $\tilde{\mathcal{O}}(\epsilon^{-8})$ for finding an $\epsilon$-Nash equilibrium. Technically, our analysis develops a coupled Lyapunov-drift framework. This framework simultaneously handles stochastic iterative algorithms with multiple interacting stochastic iterates, the non-zero-sum auxiliary games generated by independently updated value functions, and the time-inhomogeneous Markovian noise induced by time-varying policies. The resulting tools may be useful more broadly for analyzing learning algorithms with coupled stochastic iterates and nonstationary sampling processes. arXiv:2409.01447v3 Announce Type: replace Abstract: We present a finite-sample analysis of decentralized learning in two-player zero-sum matrix games and stochastic games, with a focus on best-response-based learning algorithms. In matrix games, the learning algorithm is payoff-based and symmetric: each player updates its policy using only its own payoff observations, incrementally moving toward an estimated smoothed best response to the opponent's latest policy. For stochastic games, we build on this matrix-game primitive to develop a learning algorithm called value iteration with smoothed best response (VI-SBR), which combines smoothed-best-response learning in induced matrix games with a decentralized, model-free approximation of minimax value iteration. We establish finite-sample guarantees in both settings. For matrix games, our results imply a sample complexity of $\mathcal{O}(\epsilon^{-1})$ for finding an $\epsilon$-Nash distribution and, with explicit exploration, $\tilde{\mathcal{O}}(\epsilon^{-8})$ for finding an $\epsilon$-Nash equilibrium. For stochastic games, we prove that the exploration-enhanced VI-SBR algorithm achieves a sample complexity of $\tilde{\mathcal{O}}(\epsilon^{-8})$ for finding an $\epsilon$-Nash equilibrium. Technically, our analysis develops a coupled Lyapunov-drift framework. This framework simultaneously handles stochastic iterative algorithms with multiple interacting stochastic iterates, the non-zero-sum auxiliary games generated by independently updated value functions, and the time-inhomogeneous Markovian noise induced by time-varying policies. The resulting tools may be useful more broadly for analyzing learning algorithms with coupled stochastic iterates and nonstationary sampling processes.