arXiv cs.LG (Machine Learning) INT ai 2026-05-08 13:00

MARVL: 視覚と言語モデルによるロボット操作のための多段階ガイダンス

原題: MARVL: Multi-Stage Guidance for Robotic Manipulation via Vision-Language Models

分析結果

カテゴリ: 教育
重要度: 59
トレンドスコア: 18
要約: 効率的なロボット強化学習（RL）には、密な報酬関数の設計が重要です。しかし、ほとんどの密な報酬は手動で設計されており、根本的な問題があります。本研究では、視覚と言語モデルを用いた多段階ガイダンスを提案し、ロボット操作の効率を向上させる方法を探ります。
キーワード: reward task multi robotic manipulation vision language dense

arXiv:2602.15872v3 Announce Type: replace-cross Abstract: Designing dense reward functions is pivotal for efficient robotic Reinforcement Learning (RL). However, most dense rewards rely on manual engineering, which fundamentally limits the scalability and automation of reinforcement learning. While Vision-Language Models (VLMs) offer a promising path to reward design, naive VLM rewards often misalign with task progress, struggle with spatial grounding, and show limited understanding of task semantics. To address these issues, we propose MARVL-Multi-stAge guidance for Robotic manipulation via Vision-Language models. MARVL fine-tunes a VLM for spatial and semantic consistency and decomposes tasks into multi-stage subtasks with task direction projection for trajectory sensitivity. Empirically, MARVL significantly outperforms existing VLM-reward methods on the Meta-World benchmark, demonstrating superior sample efficiency and robustness on sparse-reward manipulation tasks. arXiv:2602.15872v3 Announce Type: replace-cross Abstract: Designing dense reward functions is pivotal for efficient robotic Reinforcement Learning (RL). However, most dense rewards rely on manual engineering, which fundamentally limits the scalability and automation of reinforcement learning. While Vision-Language Models (VLMs) offer a promising path to reward design, naive VLM rewards often misalign with task progress, struggle with spatial grounding, and show limited understanding of task semantics. To address these issues, we propose MARVL-Multi-stAge guidance for Robotic manipulation via Vision-Language models. MARVL fine-tunes a VLM for spatial and semantic consistency and decomposes tasks into multi-stage subtasks with task direction projection for trajectory sensitivity. Empirically, MARVL significantly outperforms existing VLM-reward methods on the Meta-World benchmark, demonstrating superior sample efficiency and robustness on sparse-reward manipulation tasks.