スケールでの挑戦的ベンチマークのためのインターネット検索
原題: Searching the Internet for Challenging Benchmarks at Scale
分析結果
- カテゴリ
- 宇宙
- 重要度
- 53
- トレンドスコア
- 12
- 要約
- 本記事では、スケールでの挑戦的なベンチマークを見つけるためのインターネット検索の重要性について論じています。特に、さまざまなデータセットや評価基準を用いて、機械学習モデルの性能を測定する方法に焦点を当てています。これにより、研究者や開発者は、より効果的なアルゴリズムの開発や改善に役立つ情報を得ることができます。
- キーワード
arXiv:2509.26619v2 Announce Type: replace-cross Abstract: Many static benchmarks are beginning to saturate: as models rapidly improve, they achieve near-perfect scores on fixed test sets, leaving little headroom to expose genuine model weaknesses -- and even expert-curated challenge sets quickly saturate after hillclimbing. We present a fully automatic framework that searches the Internet at scale to construct challenging benchmarks without human curation. The key insight is to model the Internet as a vast space of topics and formalize the search as a multi-armed bandit problem, where each topic's difficulty is revealed only through expensive sample-and-evaluate queries. Our epsilon-greedy strategy identifies the most challenging topics while exploring only 6% of the search space -- a 100 times cost reduction over exhaustive evaluation. We validate on machine translation and knowledge question answering, confirming that discovered difficulty is robust across independent metrics (GEMBA-SQA and MetricX), languages, and models. arXiv:2509.26619v2 Announce Type: replace-cross Abstract: Many static benchmarks are beginning to saturate: as models rapidly improve, they achieve near-perfect scores on fixed test sets, leaving little headroom to expose genuine model weaknesses -- and even expert-curated challenge sets quickly saturate after hillclimbing. We present a fully automatic framework that searches the Internet at scale to construct challenging benchmarks without human curation. The key insight is to model the Internet as a vast space of topics and formalize the search as a multi-armed bandit problem, where each topic's difficulty is revealed only through expensive sample-and-evaluate queries. Our epsilon-greedy strategy identifies the most challenging topics while exploring only 6% of the search space -- a 100 times cost reduction over exhaustive evaluation. We validate on machine translation and knowledge question answering, confirming that discovered difficulty is robust across independent metrics (GEMBA-SQA and MetricX), languages, and models.