arXiv cs.AI INT ai 2026-05-26 13:00

UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents

分析結果

カテゴリ: AI
重要度: 87
トレンドスコア: 46
要約: UniToolCallは、LLMエージェントのツール使用能力を統一的に標準化するフレームワークで、22,000以上のツールを含む大規模なデータセットを活用して性能を向上させます。これにより、複雑なマルチターン推論が可能となり、商用モデルを上回る精度を達成します。
キーワード: UniToolCall LLMエージェントツール使用統一フレームワークマルチターン推論データセット性能向上
長期重要性: AI技術の進化により、半年〜数年で重要性が増す可能性があります。
ビジネス可能性: AIツールの標準化と性能向上により、新たなビジネスチャンスが生まれる可能性があります。
日本波及可能性: 高: 日本のAI研究や産業界において、ツール使用の標準化と性能向上は競争力強化に寄与する可能性があります。

arXiv:2604.11557v2 Announce Type: replace Abstract: Tool-use capability is a fundamental component of LLM agents, enabling them to interact with external systems through structured function calls. However, existing research exhibits inconsistent interaction representations, largely overlooks the structural distribution of tool-use trajectories, and relies on incompatible evaluation benchmarks. We present UniToolCall, a unified framework for tool learning that standardizes the entire pipeline from toolset construction and dataset generation to evaluation. The framework curates a large tool pool of 22k+ tools and constructs a hybrid training corpus of 390k+ instances by combining 10 standardized public datasets with structurally controlled synthetic trajectories. It explicitly models diverse interaction patterns, including single-hop vs. multi-hop and single-turn vs. multi-turn, while capturing both serial and parallel execution structures. To support coherent multi-turn reasoning, we further introduce an Anchor Linkage mechanism that enforces cross-turn dependencies. Furthermore, we convert 7 public benchmarks into a unified Query--Action--Observation--Answer (QAOA) representation with fine-grained evaluation at the function-call, turn, and conversation levels. Experiments show that fine-tuning Qwen3-8B on our dataset substantially improves tool-use performance. Under the distractor-heavy Hybrid-20 setting, achieves 93.0% single-turn Strict Precision, outperforming commercial models including GPT, Gemini, and Claude. arXiv:2604.11557v2 Announce Type: replace Abstract: Tool-use capability is a fundamental component of LLM agents, enabling them to interact with external systems through structured function calls. However, existing research exhibits inconsistent interaction representations, largely overlooks the structural distribution of tool-use trajectories, and relies on incompatible evaluation benchmarks. We present UniToolCall, a unified framework for tool learning that standardizes the entire pipeline from toolset construction and dataset generation to evaluation. The framework curates a large tool pool of 22k+ tools and constructs a hybrid training corpus of 390k+ instances by combining 10 standardized public datasets with structurally controlled synthetic trajectories. It explicitly models diverse interaction patterns, including single-hop vs. multi-hop and single-turn vs. multi-turn, while capturing both serial and parallel execution structures. To support coherent multi-turn reasoning, we further introduce an Anchor Linkage mechanism that enforces cross-turn dependencies. Furthermore, we convert 7 public benchmarks into a unified Query--Action--Observation--Answer (QAOA) representation with fine-grained evaluation at the function-call, turn, and conversation levels. Experiments show that fine-tuning Qwen3-8B on our dataset substantially improves tool-use performance. Under the distractor-heavy Hybrid-20 setting, achieves 93.0% single-turn Strict Precision, outperforming commercial models including GPT, Gemini, and Claude.