統計
原題: Statistic
分析結果
- カテゴリ
- AI
- 重要度
- 54
- トレンドスコア
- 18
- 要約
- 統計とは、データのサンプルから計算された数値的な量であり、そのサンプルの特性を記述または要約するものです。
- キーワード
Statistic — Grokipedia Fact-checked by Grok 3 months ago Statistic Ara Eve Leo Sal 1x A statistic is a numerical quantity calculated from a sample of data that describes or summarizes a characteristic of that sample, such as its central tendency , variability, or distribution. [1] In contrast, a parameter is a corresponding numerical value that describes a characteristic of the entire population from which the sample is drawn, though parameters are typically unknown and must be estimated using statistics. [1] This distinction is fundamental in statistical inference , where sample statistics serve as estimators for population parameters to make generalizations about larger groups based on limited data . [2] Common examples of statistics include the sample mean (the average value in the sample), the sample proportion (the fraction of the sample exhibiting a particular trait), the sample median (the middle value when data are ordered), and the sample standard deviation (a measure of data spread around the mean). [1] These are computed directly from observable sample data and are essential tools in descriptive statistics for summarizing datasets. [3] In inferential statistics, such measures enable hypothesis testing, confidence interval construction, and prediction, allowing researchers to draw reliable conclusions about populations despite sampling variability. [4] The use of statistics is pivotal in data analysis across disciplines, as they transform raw data into interpretable insights, quantify uncertainty , and support evidence-based decision-making in fields like science , economics , medicine , and social sciences. [5] For instance, in clinical trials, sample statistics such as means and proportions help evaluate treatment efficacy by estimating population-level effects. [6] Advances in computational tools have further amplified their role, facilitating the analysis of massive datasets while maintaining rigorous statistical principles to ensure validity and reproducibility . [7] Fundamentals Definition A statistic is any measurable function of the observations in a random sample drawn from a population , typically denoted as T ( X ) T(\mathbf{X}) T ( X ) , where X = ( X 1 , X 2 , … , X n ) \mathbf{X} = (X_1, X_2, \dots, X_n) X = ( X 1 , X 2 , … , X n ) represents the sample of n n n independent and identically distributed random variables from the underlying probability distribution . [8] This function transforms the raw sample data into a summary value that captures essential features of the sample without invoking knowledge of the population's unknown characteristics. [9] Statistics serve as the foundational tools in statistical analysis for describing and summarizing sample data, enabling inferences about the broader population while remaining agnostic to specific distributional assumptions beyond the sample's randomness. [10] For instance, they provide quantifiable measures of central tendency, variability, or other properties directly from the observed values, facilitating data reduction and pattern recognition in empirical studies. The term "statistic," in its modern sense as a sample-derived quantity distinct from a population parameter , was introduced by R. A. Fisher in his 1922 paper "On the Mathematical Foundations of Theoretical Statistics," where he described such functions as "statistical derivates... which are designed to estimate the values of the parameters of the hypothetical population ." [11] This distinction emphasized the role of statistics in estimation , separating observable sample summaries from fixed but unknown population traits. The concept of a statistic presupposes familiarity with random variables—stochastic entities that model uncertain outcomes—and probability distributions, which specify the likelihood of those outcomes and form the basis for sampling from populations. [12] These building blocks ensure that statistics inherit probabilistic properties from the sample, allowing for rigorous analysis of their behavior under repeated sampling. [13] Relation to Parameters In statistics, a key distinction exists between a parameter and a statistic: a parameter is a fixed but typically unknown numerical characteristic that describes an entire population , such as the population mean, whereas a statistic is a calculable value derived from a sample of data that summarizes the sample's properties. [1] [2] This separation underscores the inferential role of statistics , as they provide observable approximations to otherwise inaccessible population parameters based on limited data. [14] Statistics play a central role in point estimation, where they function as plug-in estimators to approximate unknown parameters; for instance, the sample mean serves as an estimator for the population mean μ \mu μ . [15] [16] In general, an estimator θ ^ \hat{\theta} θ ^ for a parameter θ \theta θ is expressed as a function of the sample data X X X , denoted mathematically as θ ^ = T ( X ) \hat{\theta} = T(X) θ ^ = T ( X ) , where T T T is the estimating function that transforms the observed sample into a point estimate. [17] [18] This formulation allows statisticians to use sample-based computations as direct substitutes for population values in practical analysis. Within the framework of frequentist inference , statistics enable the approximation of parameters through the concept of repeated sampling: under this approach, parameters are treated as fixed unknowns, and the behavior of a statistic across hypothetical repeated samples from the population provides a basis for assessing how well the statistic approximates the true parameter value. [19] [20] This repeated-sampling perspective grounds the reliability of estimation by evaluating long-run performance, thereby linking observable sample statistics to inferences about the broader population . [15] Examples Univariate Statistics Univariate statistics encompass descriptive measures applied to a single variable in a sample, providing summaries of central tendency and dispersion to facilitate data interpretation and analysis . These statistics are computed directly from the observed data points, offering practical insights into the distribution without assuming an underlying population model beyond basic ordering or arithmetic. Common univariate statistics include measures of location such as the sample mean , median , and mode, alongside dispersion metrics like the sample variance, range, and interquartile range . The sample mean , denoted X ˉ \bar{X} X ˉ , serves as a primary measure of central tendency , representing the arithmetic average of the sample values. It is calculated using the formula X ˉ = 1 n ∑ i = 1 n X i , \bar{X} = \frac{1}{n} \sum_{i=1}^{n} X_i, X ˉ = n 1 i = 1 ∑ n X i , where n n n is the sample size and X i X_i X i are the individual observations. This statistic is particularly useful for symmetric distributions and when estimating the population mean μ \mu μ , as it weights each data point equally. For example, in a sample of test scores {70, 80, 90}, the sample mean is X ˉ = ( 70 + 80 + 90 ) / 3 = 80 \bar{X} = (70 + 80 + 90)/3 = 80 X ˉ = ( 70 + 80 + 90 ) /3 = 80 , indicating the average performance. The sample mean is sensitive to outliers, which can skew it away from the typical value. [21] [22] Non-parametric measures of central tendency , such as the median and mode, are robust to outliers and do not rely on arithmetic means, making them suitable for skewed or ordinal data . The median is the middle value in an ordered sample, providing a measure of location that divides the data into equal halves. For an odd sample size n = 2 m + 1 n = 2m + 1 n = 2 m + 1 , the median is the ( m + 1 ) (m+1) ( m + 1 ) th ordered value; for an even sample size n = 2 m n = 2m n = 2 m , it is the average of the m m m th and ( m + 1 ) (m+1) ( m + 1 ) th ordered values. Consider the ordered sample {1, 3, 5, 7, 9} (odd size): the median is 5. For {1, 3, 5, 7} (even size), the median is ( 5 + 3 ) / 2 = 4 (5 + 3)/2 = 4 ( 5 + 3 ) /2 = 4 , or more precisely the average of the two central values. The mode is the value that occurs most frequently in the sample, useful for identifying peaks in categorical or discrete data ; multimodal samples have multiple modes if values share the highest frequency. In the sample {2, 2, 3, 4, 4}, there are two modes: 2 and 4. Unlike the mean, these measures prioritize ordering over summation , enhancing their applicability in non-normal distributions. [23] [24] Measures of dispersion quantify the spread of the sample values around the center, essential for understanding data variability. The sample variance, denoted s 2 s^2 s 2 , assesses the average squared deviation from the sample mean and is given by the unbiased estimator s 2 = 1 n − 1 ∑ i = 1 n ( X i − X ˉ ) 2 . s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})^2. s 2 = n − 1 1 i = 1 ∑ n ( X i − X ˉ ) 2 . This formula uses n − 1 n-1 n − 1 in the denominator, known as Bessel's correction , to account for the degrees of freedom lost when estimating the mean from the sample itself, ensuring the statistic is unbiased for the population variance σ 2 \sigma^2 σ 2 . Introduced by Friedrich Bessel in the context of least squares estimation for astronomical data in the early 19th century , the adjustment corrects the underestimation bias inherent in dividing by n n n , as the sample mean introduces dependency among the deviations. For the earlier test score sample {70, 80, 90}, s 2 = 1 2 [ ( 70 − 80 ) 2 + ( 80 − 80 ) 2 + ( 90 − 80 ) 2 ] = 100 s^2 = \frac{1}{2} [(70-80)^2 + (80-80)^2 + (90-80)^2] = 100 s 2 = 2 1 [( 70 − 80 ) 2 + ( 80 − 80 ) 2 + ( 90 − 80 ) 2 ] = 100 , indicating moderate spread. [25] [26] Simpler dispersion measures include the range and interquartile range (IQR), which avoid squaring and are computationally straightforward. The range is the difference between the maximum and minimum values in the sample, R = max ( X i ) − min (