大規模マルチタスク言語理解
原題: Massive Multitask Language Understanding
分析結果
- カテゴリ
- AI
- 重要度
- 60
- トレンドスコア
- 24
- 要約
- 大規模マルチタスク言語理解(MMLU)は、言語理解能力を評価するためのベンチマークデータセットです。このデータセットは、さまざまなタスクにおけるモデルの性能を測定するために設計されており、言語処理の研究において重要な役割を果たしています。
- キーワード
Massive Multitask Language Understanding — Grokipedia Fact-checked by Grok 3 months ago Massive Multitask Language Understanding Ara Eve Leo Sal 1x Massive Multitask Language Understanding (MMLU) is a benchmark dataset designed to evaluate the multitask accuracy and general knowledge of large language models through a diverse set of multiple-choice questions covering 57 subjects across academic and professional domains, including elementary mathematics , U.S. history, computer science , law , and clinical knowledge. [1] Introduced in 2020 by Dan Hendrycks and colleagues from the University of California, Berkeley , and Stanford University , MMLU consists of 15,908 questions compiled from various sources to test reasoning and factual recall without relying on memorization of training data. [1] [2] The benchmark is structured into development, validation, and test sets , with the development set used for few-shot prompting to simulate real-world model adaptation . [2] Since its publication in the paper "Measuring Massive Multitask Language Understanding" at the International Conference on Learning Representations (ICLR) 2021, MMLU has emerged as a foundational metric for assessing the capabilities of advanced AI models, highlighting gaps in their understanding of specialized knowledge and promoting improvements in multitask performance . [3] Key features include its broad subject coverage—spanning humanities , social sciences , STEM fields , and more—which ensures comprehensive evaluation beyond narrow tasks, and its emphasis on zero-shot or few-shot learning to measure genuine comprehension rather than rote learning . [1] Researchers have noted that MMLU's questions are sourced from real-world exams and textbooks, making it a robust proxy for professional-level expertise, though it has faced critiques for potential cultural biases in non-STEM subjects. [4] The benchmark's influence extends to leaderboards and evaluations in the AI community, where top-performing models like those from OpenAI and Google have been benchmarked against human expert baselines, often achieving scores up to over 90% accuracy as of December 2025 depending on the model size and training. [4] Ongoing developments include extensions like MMLU-Pro, which introduce harder questions to better differentiate high-performing models, underscoring MMLU's role in driving advancements in scalable oversight and general intelligence for language models . [5] Background Introduction and Purpose The Massive Multitask Language Understanding (MMLU) benchmark is a comprehensive evaluation framework designed to test the multitask accuracy and general intelligence of language models through a collection of 57 multitask benchmarks comprising 15,908 multiple-choice questions . These questions assess knowledge and reasoning abilities across a wide range of difficulty levels, from elementary to advanced professional expertise. [1] The primary purpose of MMLU is to measure a model's capability to perform zero-shot or few-shot learning on diverse tasks without requiring task-specific fine-tuning , thereby emphasizing broad, emergent intelligence rather than narrow, specialized performance. This approach highlights the model's ability to generalize across unrelated domains, providing a robust indicator of its overall language understanding in multitask settings . [1] Key distinguishing features of MMLU include its coverage of high school, college, and professional-level subjects, the standardization of evaluation via a consistent four-option multiple-choice format, and its scale that surpasses prior benchmarks such as GLUE by expanding to a massive multitask scope for more holistic assessment. Introduced in the 2020 paper "Measuring Massive Multitask Language Understanding" by Dan Hendrycks and colleagues, MMLU emerged during the rapid advancement of large language models like GPT-3 , aiming to offer a more comprehensive test of "massive" multitask capabilities beyond existing narrow evaluations. [1] The benchmark spans diverse subject areas including humanities , social sciences , STEM , and other fields, serving as a foundational metric for AI progress. [1] Development History The Massive Multitask Language Understanding (MMLU) benchmark was developed by a team of researchers led by Dan Hendrycks from the University of California, Berkeley , along with co-authors Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. [2] [1] This collaborative effort aimed to create a comprehensive evaluation framework for assessing the multitask capabilities of large language models , particularly those scaling beyond 100 billion parameters, where existing benchmarks like GLUE and SuperGLUE were becoming saturated and less informative. [1] [3] The development timeline began with the submission of the foundational paper, "Measuring Massive Multitask Language Understanding," in September 2020, followed by its posting as an arXiv preprint on September 7, 2020. [1] The work was further refined through peer review and presented at the International Conference on Learning Representations (ICLR) in 2021, marking the benchmark's formal release and initial implementation as an open-source resource . [3] [2] Motivated by the rapid growth in model sizes and the need for benchmarks that could reliably test general knowledge and reasoning across diverse domains without ceiling effects, the team curated over 15,000 multiple-choice questions from existing academic exams, textbooks, and professional certification materials spanning 57 subjects. [1] This curation process involved manual verification by domain experts to ensure question quality, factual accuracy, and diversity in coverage, while the dataset was open-sourced under a permissive license to facilitate widespread adoption in AI research . [1] [2] Initial challenges during development included balancing question difficulty levels to avoid skewing evaluations toward easier tasks and mitigating risks of data contamination , where models might have encountered training data scraped from sources overlapping with the benchmark questions. [1] [3] The researchers addressed these by selecting questions from high-school , college , and professional-level sources that were less likely to appear in common web corpora , and by emphasizing tasks requiring genuine reasoning over rote memorization . [1] These efforts established MMLU as a robust, scalable metric that has since become a cornerstone for evaluating language model progress. [2] Dataset Composition Overall Structure The Massive Multitask Language Understanding (MMLU) dataset is structured as a comprehensive benchmark comprising 57 distinct tasks, encompassing a total of 15,908 multiple-choice questions divided into a few-shot development set (5 questions from each of the 57 tasks), a validation set (1,540 questions), and a test set (with 14,079 questions). [6] [7] Each task typically contains between 300 and 1,000 questions overall, ensuring a robust sample size for evaluation while maintaining variability across tasks. [8] Organizationally, the tasks are hierarchically grouped into four primary categories: humanities , social sciences , STEM ( science, technology, engineering, and mathematics ), and other professional fields, facilitating broad multitask assessment of language models . [6] This structure supports evaluation in both zero-shot settings , where models receive no task-specific examples, and 5-shot few-shot settings, where five example questions per task are provided to gauge in-context learning capabilities. [4] The dataset is designed to promote balanced and diverse multitask evaluation, requiring models to address all tasks without any fine-tuning or adaptation , thereby testing general knowledge and reasoning across domains . [6] It incorporates questions of varying difficulty levels, ranging from high school to professional or PhD -equivalent expertise, to comprehensively probe model performance on foundational and advanced topics. [9] Technically, the MMLU dataset is released in JSON format , allowing for easy parsing and integration into evaluation pipelines, with questions anonymized to mitigate data leakage risks and ensure fair testing. [10] Question Format and Sourcing The questions in the Massive Multitask Language Understanding (MMLU) benchmark follow a standardized multiple-choice format designed to facilitate automated evaluation of language models . Each question consists of a stem presenting the problem or query, followed by four answer choices labeled A through D, with exactly one correct answer. Models are prompted to select the correct choice by generating text output that matches the corresponding letter (e.g., "A"), which simplifies parsing and scoring while minimizing errors in interpretation. [1] [9] The questions are sourced from a variety of established materials to ensure broad coverage of academic and professional knowledge, including textbooks, standardized college-level exams such as AP tests and GRE subject tests , and professional certification resources. This approach draws from real-world educational and testing contexts to promote reproducibility and relevance. To maintain accuracy, the dataset underwent manual verification and curation by the authors, with Amazon Mechanical Turk used to establish non-expert human baseline performance, helping to confirm the difficulty level. [1] Quality controls were rigorously applied during dataset construction , including the removal of ambiguous, outdated, or erroneous questions identified through manual review and error analysis. Efforts were also made to avoid direct copies from common language model training data , such as widely available web texts , to better assess true generalization and reasoning abilities rather than memorization. Although most questions emphasize factual recall, some incorporate variations like abstract reasoning or scenario-based prompts (e.