arXiv cs.LG (Machine Learning) INT ai 2026-05-08 13:00

オフポリシー強化学習における批評家学習のための低ランク適応

原題: Low-Rank Adaptation for Critic Learning in Off-Policy Reinforcement Learning

分析結果

カテゴリ: 法律・制度
重要度: 61
トレンドスコア: 20
要約: オフポリシー強化学習において、批評家の能力を拡張することは有望な方向性です。しかし、最近の研究では、より大きな批評家は過学習しやすいことが示されています。この問題に対処するために、低ランク適応を用いることで、批評家の性能を向上させる方法が提案されています。
キーワード: critic low policy learning rank off training structural

arXiv:2604.18978v2 Announce Type: replace Abstract: Scaling critic capacity is a promising direction for improving off-policy reinforcement learning (RL). However, recent work shows that larger critics are prone to overfitting and instability in replay-based bootstrapped training. In this paper, we propose using Low-Rank Adaptation (LoRA) as a structural regularizer for critic learning. Our approach freezes randomly initialized base matrices and optimizes only the corresponding low-rank adapters, thereby constraining critic updates to a low-dimensional subspace. We evaluate our method across different off-policy RL algorithms, including SAC and FastTD3 based on different network architectures. Empirically, LoRA efficiently reduces critic loss during training and improves overall policy performance, achieving the best or competitive results on most tasks. Extensive experiments demonstrate that our low-rank updates provide a simple and effective form of structural regularization for critic learning in off-policy RL. arXiv:2604.18978v2 Announce Type: replace Abstract: Scaling critic capacity is a promising direction for improving off-policy reinforcement learning (RL). However, recent work shows that larger critics are prone to overfitting and instability in replay-based bootstrapped training. In this paper, we propose using Low-Rank Adaptation (LoRA) as a structural regularizer for critic learning. Our approach freezes randomly initialized base matrices and optimizes only the corresponding low-rank adapters, thereby constraining critic updates to a low-dimensional subspace. We evaluate our method across different off-policy RL algorithms, including SAC and FastTD3 based on different network architectures. Empirically, LoRA efficiently reduces critic loss during training and improves overall policy performance, achieving the best or competitive results on most tasks. Extensive experiments demonstrate that our low-rank updates provide a simple and effective form of structural regularization for critic learning in off-policy RL.