標準化テスト
原題: Standardized test
分析結果
- カテゴリ
- AI
- 重要度
- 60
- トレンドスコア
- 24
- 要約
- 標準化テストとは、信頼性のある評価を行うために、均一な条件下で実施、採点、解釈される評価手法です。
- キーワード
Standardized test — Grokipedia Fact-checked by Grok 3 months ago Standardized test Ara Eve Leo Sal 1x A standardized test is an assessment administered, scored, and interpreted under uniform conditions to permit reliable comparisons of performance across test-takers, typically involving fixed content, time limits, and scoring rubrics derived from empirical norming or criterion-referencing. [1] These tests emerged in the early 20th century United States as tools for efficiently sorting students by ability amid expanding public education systems, evolving from rudimentary civil service exams to widespread use in K-12 accountability , college admissions, and professional licensing. [2] Empirically, standardized tests demonstrate strong predictive validity for academic and occupational outcomes, often outperforming alternatives like high school grades in forecasting college GPA and graduation rates due to their resistance to grade inflation and subjective bias. [3] [4] Despite controversies alleging cultural or socioeconomic bias—claims frequently amplified in academic discourse but undermined by longitudinal data showing consistent validity across demographic groups—they enable merit-based selection by quantifying cognitive skills causally linked to complex task performance, though critics argue they incentivize narrow curriculum focus at the expense of broader learning. [1] [4] Definition and Core Principles Definition and Purpose A standardized test is an assessment that requires all test-takers to answer the same questions, or a selection from a common question bank, under uniform administration and scoring procedures to enable consistent comparison of performance across individuals or groups. [5] This standardization ensures that variations in results reflect differences in abilities rather than discrepancies in testing conditions, with reliability established through empirical validation on large representative samples. [6] Such tests are typically objective, often featuring formats like multiple-choice items that minimize subjective scoring, though they may include constructed-response elements scored via rubrics. [5] The core purpose of standardized testing is to measure specific knowledge, skills, or aptitudes against established norms or criteria, facilitating objective evaluations for decision-making in education , employment , and certification . [7] Norm-referenced tests compare individuals to a peer group , yielding percentile ranks or standard scores derived from a normal distribution , while criterion-referenced tests assess mastery of predefined standards independent of others' performance. [5] These instruments support high-stakes applications, such as college admissions via exams like the SAT , where over 1.9 million U.S. students participated in 2023 to demonstrate readiness, or accountability measures under policies like No Child Left Behind, which mandated annual testing in reading and mathematics for grades 3-8 from 2002 onward to track proficiency rates. [8] By providing quantifiable data, standardized tests inform resource allocation , curriculum adjustments, and identification of achievement gaps, though their validity depends on alignment with intended constructs and avoidance of cultural biases confirmed through psychometric analysis. [9] [6] In professional contexts, standardized tests serve selection and licensure functions, such as the Graduate Record Examination (GRE) used by over 300 graduate programs annually to predict academic success, or civil service exams that screened applicants for U.S. federal positions since the Pendleton Act of 1883, reducing patronage by prioritizing merit-based scoring. [8] Overall, their design promotes fairness by mitigating evaluator bias, enabling large-scale assessments that individual judgments cannot match in scalability or comparability. [10] Key Characteristics of Standardization Standardization in testing refers to the establishment of uniform procedures for test administration, scoring, and interpretation to ensure comparability of results across test-takers. This process mandates that all examinees encounter identical or statistically equivalent test items, receive the same instructions, adhere to consistent time limits, and complete the assessment under comparable environmental conditions, such as quiet settings and supervised proctoring. [5] [11] Such uniformity minimizes extraneous variables that could influence performance, enabling scores to reflect inherent abilities or knowledge rather than situational differences. [12] A core feature is objective scoring, where responses are evaluated using predetermined criteria that reduce or eliminate subjective judgment, often through machine-readable formats like multiple-choice items or automated essay scoring algorithms calibrated against human benchmarks. This objectivity contrasts with teacher-made assessments, where variability in grading can introduce bias ; standardized tests achieve high inter-rater reliability , typically exceeding 0.90 in psychometric evaluations, by employing fixed answer keys or rubrics validated through empirical trials. [13] Equivalent forms—alternate versions of the test with parallel difficulty and content—are developed and equated statistically to prevent advantages from prior exposure, ensuring fairness in repeated administrations such as annual proficiency exams. [14] Norming constitutes another essential characteristic, involving the administration of the test to a large, representative sample of the target population—often thousands stratified by age, gender , socioeconomic status , and geography—to derive percentile ranks, standard scores, or stanines that contextualize individual performance. For instance, norms for aptitude tests like the SAT are updated periodically using samples exceeding 1 million U.S. high school students to reflect demographic shifts and maintain relevance. [15] This process relies on psychometric techniques, including item response theory , to calibrate difficulty and discriminate ability levels, yielding reliable metrics where test-retest correlations often surpass 0.80 over short intervals. [16] Without rigorous norming, scores lack interpretive validity, as evidenced by historical revisions to IQ tests that adjusted for the Flynn effect —a documented 3-point-per-decade rise in scores due to environmental factors. [17] Finally, standardization incorporates safeguards for accessibility and equity, such as accommodations for disabilities (e.g., extended time verified through empirical validation studies) while preserving test integrity , and ongoing validation against external criteria like academic outcomes to confirm predictive utility. These elements collectively underpin the test's reliability—consistency of scores under repeated conditions—and validity—alignment with intended constructs—hallmarks of psychometric soundness. [18] [19] Historical Development Ancient and Early Modern Origins The earliest known system of standardized testing emerged in ancient China during the Han dynasty (206 BCE–220 CE), where initial forms of merit-based selection for government officials involved recommendations and rudimentary assessments of scholarly knowledge, primarily drawn from Confucian texts. [20] This evolved into a more formalized examination process by the Sui dynasty (581–618 CE), with Emperor Wen establishing the first imperial examinations in 605 CE to recruit civil servants based on uniform evaluations of candidates' mastery of classical literature, ethics, and administrative skills. [21] These tests were administered nationwide at provincial, metropolitan, and palace levels, featuring standardized formats such as essay writing on prescribed topics from the Five Classics and policy memoranda, with anonymous grading to minimize favoritism and corruption. [22] By the Tang dynasty (618–907 CE), the system had standardized further, emphasizing rote memorization, poetic composition, and interpretive analysis under timed conditions, serving as a meritocratic tool for social mobility that bypassed hereditary privilege in favor of demonstrated competence. [23] Success rates were low, with only about 1–5% of candidates passing the highest levels across dynasties, reflecting rigorous norming against elite scholarly standards. The Song dynasty (960–1279 CE) refined the process with printed question papers and multiple-choice elements in some sections, increasing scale to thousands of examinees per cycle and institutionalizing it as a cornerstone of bureaucratic selection. [23] In contrast, ancient Western traditions, such as those in Greece and Rome , relied on non-standardized oral examinations and rhetorical displays rather than uniform written tests. Greek education in city-states like Athens involved assessments through debates and recitations evaluated subjectively by teachers, prioritizing dialectical skills over quantifiable metrics. [24] Roman systems similarly featured public orations and legal disputations for entry into professions, lacking the centralized, anonymous scoring of Chinese exams. [24] During the early modern period in China (Ming and Qing dynasties, 1368–1912 CE), the keju system persisted with enhancements like stricter content uniformity and anti-cheating measures, such as secluded testing halls, testing up to 10,000 candidates per session and maintaining predictive validity for administrative roles through empirical correlations with performance in office. In Europe , early modern assessments remained predominantly oral or essay-based in universities, with no widespread adoption of standardized formats until the 19th century , when British administrators drew indirect inspiration from Chinese models for colonial civil services. [25] 19th and Early 20th Century Innovations In the mid-19th century, educational reformers in the United States began transitioning from oral examinations to standardized written assessme