IT1 Logo

GLOSSARY OF STATISTICAL  
TERMS 
 
 
 
 
 

 
 
 

[p006cb]



Introduction

This glossary defines all of the statistical tests and terms, together with other economic and arithmetical terms and concepts, which are employed in the workshops and computer classes. In addition, you should consult the background papers which contain definitions and the underlying calculations for national income, labour market and balance of payments statistics.

Where appropriate the Excel functions are included in the following glossary entries. You should consult Judge (1990) and the Student's Excel for their use. The term q.v. indicates that reference is being made to a further term included elsewhere in this glossary.

Arithmetic mean
A measure of central tendency (q.v.). Defined as the sum () of a set of numbers (X) divided by the number of cases in the set (N) (q.v.):

 X



   N

The arithmetic mean is what most people have in mind when they talk of the average. Since there are various types of averages - median (q.v.), mode (q.v.), weighted and unweighted averages (q.v.) - it is best not to use the term average. Calculated by =average.

Average
A misleading term, best not used. See arithmetic mean; mode.

Central tendency, measures of
A set of statistical tests concerned with measuring the central point of data. See also dispersion, measures of.

Chi-square
Pronounced 'ky square', the X2 is an inferential statistic which measures the degree of independence (or dependence) of one variable on another and is computed by comparing the joint frequency distribution observed in a contingency table with the expected joint frequency distribution that would be found if the two variables were not related, i.e. were independent (the null hypothesis that the two variables are statistically unrelated). It is defined as:

         R         C
X2      (Oij-Eij)2/Eij
        i=1     j=1

where R is the number of rows, C is the number of columns, i is the row subscript, j the column subscript and Oij and Eij are respectively the observed and the expected values for each cell.

NB: special rules apply for 2 by 2 contingency tables (i.e. those with two rows and two columns) which, of course, provide only 1 degree of freedom. Special rules also apply when N<40, but as a general rule one should not attempt this sort of quantitative research with such a small sample.

The magnitude of the X2 depends on: the number of cases (N); the size of the contingency table; and the strength of the relationship between the variables. See Cramér's V2 measure of association.

Coefficient of determination
In regression analysis (q.v.) this is a measure of causal association between Y and X, where the higher the number (maximum +1; minimum 0) the greater the association. More formally; it is a measure of the proportion of the total variation in Y that is associated with variations in X in the regression equation, Y = a + b.X. Thus a coefficient of determination of say 0.85 means that 85% of the variance (q.v.) in Y is attributable to the regression of Y on X and 15% is unexplained (so that the regression equation should be augmented to incorporate further likely causal factors). This is the R2 (q.v.) result produced by Excel; and, mathematically, it is the square of the correlation coefficient (q.v.).

Coefficient of variation
This is used to compare two or more samples of a population where the means (q.v.) differ significantly and thus where the standard deviation (q.v.) is an inadequate measure of dispersion. Defined as:

standard deviation
_______________* 100
mean

Compound growth rate
See geometric mean.

Confidence interval
One of the properties of the normal distribution is that 95 per cent of all values of a continuous, normally-distributed variable in a population fall between 1.96 standard deviations (q.v.) from the mean.

Correlation analysis
A procedure for examining the association between two or more variables. Correlation may be positive, negative or not present.

Correlation coefficient
More properly, the product-moment correlation coefficient. This can be defined in a number of ways, but is here represented in terms of standard deviations (q.v.) since the correlation coefficient is a procedure which entails calculating the deviations between Y and X and their respective mean values. Defined as:

                        X.Y - 1/N.X.Y
r = __________________________________________
               (X2 - 1/N.(X)2).(Y2 - 1/N(Y)2)

This is a measure of the degree of linear association present between Y and X (maximum + 1; minimum -1).

Cramér's V2 measure of association
Cramér's V2 measure of association divides the chi-square statistic (q.v.) X2 by the size of the table (L-1) and the number of cases (N) to isolate the strength of the relationship, where L is defined as the lesser of the number of rows or columns. It is defined as:

             X2
V2_____
         (L-1)N

and its values fall within the range 0 and 1, with the higher the value the stronger the relationship.

Cross-sectional data
Data recorded over a number of units (for example, households, firms, economies) within a particular period or at a single point.

Degrees of freedom
Defined as the difference between the total number of mathematical variables for a statistical model and the number of independent restrictions placed upon them. The smaller the sample the lower the degrees of freedom and the greater the possibility that the null hypothesis (q.v.) cannot be rejected at the usual significance levels.

Denominator
In a fraction this is the value or argument which is used to divide the numerator (q.v.). Thus in the following:

=sum(c11..c15)
______________
     (B$12)

=sum(c11..c15) is the numerator and (B$12) the denominator.

Dependency ratio
A term used to describe the proportion of the population who are economically inactive relative to the proportion of the population who are available to support them economically.

Dependent variable
In regression analysis (q.v.) this is the variable (Y) which is said to be dependent upon X, the independent variable (q.v.). For example, were we to explore the relationship between the number of cigarettes smoked and the incidence of lung cancer we would construct a regression in which the former was X and the latter Y since it would be nonsensical to postulate that lung cancer causes smoking. In practice it is unlikely that the incidence of lung cancer can be fully explained by cigarette consumption and thus additional independent variables would be included in the regression. This is known as multiple regression (q.v.).

Dispersion, measures of
A set of statistical tests concerned with measuring how closely data are clustered around the central point. See also central tendency, measures of.

Disturbance term
In regression analysis (q.v.) the disturbance term (u) (or residual, error or stochastic term) is incorporated in the regression equation since, in practice, it is unlikely that all XY points fall exactly on the linear relationship. This is assumed to be normally distributed, having zero expected values or mean; a constant variance; and with the successive values being uncorrelated or unrelated to each other.

The disturbance term cannot be directly observed, but û (u hat, where a 'hat' or ^ over a variable signifies the fitted value of a variable) can be computed, being the difference between the actual values of the dependent variable (X) and the fitted values (X hat) using the regression equation. Any marked pattern in the residuals may invalidate the assumptions underlying the regression model (q.v.). See Durbin-Watson statistic.

Durbin-Watson statistic
In regression analysis (q.v.) this is a test for autocorrelation, this being present when the disturbance term u (q.v.) in any period t is not independent of disturbances in other periods. Defined as:
 

DW (                      
          t=2   ût-ût-1)2/   t= 1    ût2

The DW statistic lies in the range 0 to 4 and under the null hypothesis (q.v.) of no autocorrelation is expected to be equal to 2. As with the t-test (q.v.) there are critical values, again available from statistics textbooks, which are looked up once the available degrees of freedom (q.v.) has been derived.

Elasticity
The responsiveness of one variable to changes in another variable. Thus, for example, if the demand for cars rises by 2% in response to a 1% fall in the price of cars we would derive an (own price) elasticity of 2 whereas if demand had only risen by 0.5 per cent we would derive an elasticity of 0.5. In the former case we would describe demand as elastic and in the latter relatively inelastic.

Geometric mean
This is best illustrated by a numerical example. Consider the following for the British economy over a six year period:

Year

Nominal GDP
(&pound; billions)

Indices
19793=100

% growth rate on
previous year

1979

196,706

100.0

 

1980

230,602

117.2

17.2

1981

254,103

129.2

10.2

1982

276,409

140.5

8.8

1983

300,973

153.0

8.9

1984

320,120

162.7

6.4

Source: CSO (1987) United Kingdom national accounts, 1987 edition, table 1.2.

If asked to calculate the average growth rate for this period you might be tempted to take the arithmetic mean of the year-on-year percentage growth rates recorded in the table. This would yield the result 10.29%. However, this slightly overstates the growth of the economy, a bias which would widen the longer the period over which the calculation was being made. To prevent such bias, economists calculate the geometric growth rate which can be represented as follows:
 

Year

Calculation

Result

1979

196,706

196,706

1980

196,706*1.102298

216,829

1981

216,829*1.102298

239,010

1982

239,010*1.102298

263,460

1983

263,460*1.102298

290,411

1984

290,411*1.102298

320,120

Here GDP rises in each successive year by the ratio (q.v.) of 1.102298 (equivalent to a growth rate of 10.23%) which is derived from the Nth root of the product of the year-on-year growth rates. The procedure might be likened to that of calculating compound interest on a deposit account.

Index numbers
The object of an index number is to represent in a single series (the index) movements in a whole set of underlying series. They are a useful scaling device for comparisons over time and between variables. Consider the following prices for three agricultural products which are presented in the primary source in shillings and pence per imperial quarter weight (cols. 1-3):
 

 

Wheat

Barely

Oats

Wheat

Barely

Oats

Wheat

Barely

Oats

 

(s.d)

(s.d)

(s.d)

(&pound;)

(&pound;)

(&pound;)

(1771= 

(1771= 

(1771= 

 

 

 

 

 

 

 

 

 

 

100)

100)

100)

 

(1)

(2)

(3)

(4)

(5)

(6)

(7)

(8)

(9)

1771

48

7

26

5

17

2

4.03

2.19

1.43

100.0

100.0

100.0

1772

52

3

26

1

16

8

4.35

2.17

1.37

107.9

99.1

95.8

1773

52

7

29

2

17

8

4.36

2.42

1.45

108.2

110.5

101.4

1774

54

3

29

4

18

4

4.51

2.43

1.52

111.9

111.0

106.3

1775

49

10

26

9

17

0

4.13

2.20

1.42

102.5

100.5

99.3

Source: B.R. Mitchell (1988) British historical statistics, p. 756.

From the raw data one can discern that the unit price of wheat and barley rose while that of oats fell between 1771-5 and that wheat prices were something less than double those of barley and greater still than those of oats. Transforming the prices into &pound;s provides further insights but not much. However, by a further transformation of the raw data into index numbers (1771 as base year) the relative and absolute movement in prices becomes much clearer. Thus from the table one can read off that wheat prices rose by 2.5% as against only 0.5% for barley; with the price of oats fell by 0.75. The different paths of prices between 1771-5 for the three products is also much clearer.

The index numbers for wheat are calculated as follows:
 

Year

Price

Calculation of

Calculation of

Result: index

 

(&pound;)

index

index

 

 

 

numbers

numbers in Excel

(1771=100)

 

(1)

(2)

(3)

(4)

1771

4.03

(4.03/4.03)*100

=(B2/B$2)*100

100.0

1772

4.35

(4.35/4.03)*100

=(B2/B$2)*100

107.9

1773

4.36

(4.36/4.03)*100

=(B4/B$2)*100

108.2

1774

4.51

(4.51/4.03)*100

=(B5/B$2)*100

111.9

1775

4.13

(4.13/4.03)*100

=(B6/B$2)*100

102.5

In col. (3), which illustrates the derivation of index numbers in a spreadsheet, it is assumed that the 1771 price in &pound;s is in cell B2.

See also weighted averages.

Independent variable
In regression analysis (q.v.) this is the variable (X) which is said to be independent of Y, the dependent variable (q.v.). For example, were we to explore the relationship between the number of cigarettes smoked and the incidence of lung cancer we would construct a regression in which the former was X and the latter Y since it would be nonsensical to postulate that lung cancer causes smoking. In practice it is unlikely that the incidence of lung cancer can be fully explained by cigarette consumption and thus additional independent variables would be included in the regression. This is known as multiple regression.

Logarithmic scales
In economic history our interest is often not so much in absolute but in relative changes between periods or variables. A standard X-Y or line graph, where the axes have equal intervals, can obscure such changes when working with series which have very different starting points. Consider the following graph where in 1900 public expenditure starts at roughly one-tenth of GDP.

[insert graph]

In such a representation it is very difficult to draw any conclusions from visual inspection about the relative growth rates of public expenditure and GDP before the onset of the First World. Transforming the graph into semi-log form rectifies this problem.

[insert graph]

In this graph the intervals on the Y axis are represented in such a way that a rise in a variable from say &pound;100 million to &pound;200 million is exactly proportional to rise in another variable from say &pound;600 million to &pound;1.2 billion. This is a ratio scale which derives from the same principles underlying logarithms to the base 10, namely that the logarithm of 10 is 1.0000, 100 is 2.0000, 1000 is 3.0000 and so on. Unlike graphs with axes of equal interval scales, in log or ratio graphs you can infer growth rates from the slopes of the variables.

Maximum
In a range (q.v.) the maximum value. Calculated by =max.

Mean
See arithmetic mean.

Median
A measure of central tendency, defined as that value of a series which splits an ordered list of cases into two halves (i.e. the middle value in the sense that an equal number of cases lie above this value as below it).. With an ordered series of cases the median is defined as:

 N+1
_____  where there are an odd number of cases
   2

   N
_____  where there are an even number of cases
   2

Can be calculated by =median but visual inspection of the result can be important.

Minimum
In a range (q.v.) the minimum value. Calculated by =min.

Mode
A measure of central tendency, defined as the value that occurs most often. It is often used in addition to the arithmetic mean (q.v.) when an economic historian wants to refer to a 'typical' value of a variable, such as the age of the marriage. It has the disadvantage that no measure of dispersion (q.v.) is associated with it. Can be calculated by =mode but visual inspection of the result can be important.

Moving average
Time series (q.v.) consist of three elements: the trend, the cycle (or seasonal) and a random error term. Moving averages are frequently deployed by economic historians to cleanse a time series of the cyclical (or seasonal) element so that they can concentrate on the underlying trend. The following example shows the calculation of a three-period moving average based on monthly unemployment data for 1929 produced by the national insurance system in Britain (with cell B2 assumed to contain the January 1929 unemployment rate).
 

Month

Unemployment

Calculation of

Calculation of

Result:

 

rate (%)

moving average

moving average in

moving average

 

 

 

Excel

 

January

12.2

 

 

 

February

12.1

(12.2+12.1+10.0)/3

=average(b2..b4)

11.4

March

10.0

(12.1+10.0+9.9)/3

=average(b3..b5)

10.7

April

9.9

(10.0+9.9+9.7)/3

=average(b4..b6)

9.9

May

9.7

(9.9+9.7+9.6)/3

=average(b5..b7)

9.7

June

9.6

(9.7+9.6+9.7)/3

=average(b6..b8)

9.7

July

9.7

(9.6+9.7+9.9)/3

=average(b7..b9)

9.7

August

9.9

(9.7+9.9+9.9)/3

=average(b8..b10)

9.8

September

9.9

(9.9+9.9+10.3)/3

=average(b9..b11)

10.0

October

10.3

(9.9+10.3+10.9)/3

=average(b10..b12)

10.4

November

10.9

(10.3+10.9+11.0)/3

=average(b11..b13)

10.7

December

11.0

 

 

 

Source: Department of Employment and Productivity (1971) British labour statistics,
             historical abstract, 1886-1968, table 160.

The effect of smoothing the raw data in this manner is shown in the following graph:

[insert graph]

It should be observed that in this example it is not possible to calculate the moving average for either January or December 1929. To do so for the former requires a December 1928 estimate and, for the latter, a January 1930 estimate.

Multiple regression
This builds upon the single two-variable linear model (see regression analysis and regression model) to incorporate additional explanatory (independent) variables.

N
The number of observations or cases in a series. Calculated by =count.

Nominal values
A variable denominated in current prices. See also real values.

Null hypothesis
Hypothesis testing is, in effect, a form of gambling on the basis of probabilities (see statistical significance) with the null hypothesis being, for any test of statistical significance of a relationship, that the estimated coefficients are not significantly different from zero, i.e. that there is no relationship.

Numerator
In a fraction this is the value or argument which is divided by the denominator (q.v.). Thus in the following:

=sum(c11..c15)
______________
     (B$12)

=sum(c11..c15) is the numerator and (B$12) the denominator.

Pearson's modified coefficient of skewness
A measure of dispersion, defined as:

3*(Mean-Median)
________________
Standard deviation

Whereas the standard deviation (q.v.) or variance (q.v.) provide a measure of the variability of values around the mean they give no information about the distribution of values around the mean. It is perfectly possible to have two series with identical means and standard deviations but very dissimilar frequency distributions. Pearson's modified coefficient of skewness is one method of mathematically describing the shape of a frequency distribution in terms of the relative position of the mean and median (q.v.). Thus a perfectly symmetrical distribution has a coefficient of 0 (the mean and median are identical); a positive coefficient indicates a right skewed distribution (the mean to the right of the median); and a negative coefficient a left skewed distribution (the mean to the left of the median). Two examples of right skewed distributions are shown below:

[insert graph]

[insert graph]

Source: HMSO (1983) Monopolies and Mergers Commission, National Coal Board, Cmnd. 8920, vol. II, tables 3.5(a)-(l).

R2
See coefficient of determination.

Range
A measure of dispersion, the difference between the two extreme values in a series. See also maximum and minimum.

Ratio
A procedure much used in manipulating spreadsheets whereby a fraction or percentage is transformed into a number which shows the relation of one number to another. For example, the fraction &frac14; is equivalent to the ratio 0.25, while to increase a variable by 25% we would multiply it by the ratio 1.25.

Real values
A variable denominated in constant prices and obtained by adjusting a nominal value (q.v.) (that is one denominated in current prices) by the change in prices since the base year, typically by the RPI but, for national income, by the GDP deflator. The following fictitious example illustrates how to create a constant price series for the value of a product using a spreadsheet (with the 1920 data on row 4 and the years in col. A):
 

 

Average

Index

 

Calculation

Average

 

price

of

 

of

price

 

in

average

 

constant

at 1920

 

current

price

RPI

price

constant

 

prices (&pound;)

(1920=100)

(1913=100)

series

prices

1920

405.10

100.0

244

(c3)

100.0

1921

600.00

148.1

222

(e4)*(c5/c4)/(d5/d4)

162.8

1922

485.75

119.9

179

(e5)*(c6/c5)/(d6/d5)

163.5

1923

460.95

113.8

171

(e6)*(c7/c6)/(d7/d6)

162.4

1924

450.00

111.1

172

(e7)*(c8/c7)/(d8/d7)

157.6

1925

439.95

108.6

173

(e8)*(c9/c8)/(d9/d8)

153.2

The calculation of the constant price series has three parts. Using that for 1921 to illustrate the point they are:

cell e4                                        the base index of the preceding year

the product of cells (c5/c4)         the change in the nominal price of the product since the preceding year

the product of cells (d5/d4)        the change in retail prices since the preceding year

Whereas the nominal value indicates that the price of the fictional product rose by 8.6% per cent between 1920-5, because the RPI fell by 29.1 per cent the real increase - i.e. adjusted for changes in the purchasing power of money - rose very much more. Indeed, from these calculations the real price rose by 53.2%.

See also index numbers; nominal values.

Regression analysis
An application of correlation analysis (q.v.) to explore the association between two or more time-series (q.v.) variables. This is probably the most important form of quantitative analysis used by economic historians, having many applications: the calculation of trend lines (q.v.), explaining past variations in variables, estimating key parameters (such as elasticities (q.v.) and marginal propensities) and in forecasting the future movement of time series.

Excel provides the means to calculate linear regressions of one or more independent variables of the form:

Y = a+b.X

where    Y    is the dependent variable

              a    is the intercept of the Y axis or constant

              b    the x-coefficient, the slope of the regression equation

             X    the independent variable

Excel produces a regression block of the form:


Regression Statistics



Multiple R           0.433
R Square            0.187

Adjusted R         0.136
Square
Standard            0.595
Error
Observations           18



 

ANOV
__________________________________________________________________________
                        df                  SS                 MS                F             Significance
                                           ;                                          &n bsp;                           F
__________________________________________________________________________
Regression        1                   1.309            1.309           3.698                 0.072
Residual          16                   5.667            0.354
Total               17                   6.977
__________________________________________________________________________
 
______________________________________________________________________________________________
               Coefficients       Standard Error        t Stat       P-value      Lower 95%      Upper 95%       Lower         Upper
                                           ;                                          &n bsp;                                           ;                                95.0%         95.0%
______________________________________________________________________________________________
Intercept           1.670                     0.269          6.204      1.262E-05           1.099              2.241         1.099        2.241
X Variable 1     0.051                     0.027          1.923              0.072         -0.005              0.109        -0.005        0.109

______________________________________________________________________________________________
which translates into the equation:

Y = 1.67 + 0.05X             R2 = 0.19
       (0.27) (0.03)

where the figures in parentheses are the standard errors of the Y and X coefficients respectively. In the example above the regression equation has yielded a low R2 value, i.e. only 18% of the variance (q.v.) in Y is attributable to the regression of Y on X. In addition, statisticians usually require that the standard errors be less than half the value of the parameter estimates (see t-statistic), a condition not satisfied above for the X coefficient.

Regression model
Regression analysis (q.v.) uses a model which assumes that:
 

Standard deviation
A measure of the dispersion of a series around the mean value, with the larger the value the greater the dispersion. It is defined as the square root of the sum of the squared deviations of the observations from the mean divided by the total number of observations, namely:

                _
s = (X-X)2/N

            _
where X is the mean

The standard deviation is the square root of the variance (q.v.). Calculated by =stdev or stdevp, depending on whether you are using a sample or a population.

Standard error of the intercept
In regression analysis (q.v.) this is estimated standard deviation of the sampling distribution of the estimator intercept (q.v.) . The smaller the value of the standard error the more confident one can be that an individual estimate is close to the true value of . The standard error is used for conducting t-tests (q.v.).

Standard error of the X-coefficient
In regression analysis (q.v.) this is the estimated standard deviation of the estimator , with the same interpretation as for the intercept standard error (q.v.).

Standard error of the Y estimate
In regression analysis (q.v.) a measure of the spread of the residuals (q.v.) around the regression line, and also an estimator of the standard deviation of the disturbance term (

t-statistic
Although having a generalised use in probability and sampling, the t-test is here used as the test statistic of the significance of the sample standard deviation of an estimated variable in relation to the population standard deviation. Its interpretation requires that we know the degrees of freedom (q.v.) before looking up the t-distribution in a statistics textbook.

Time series
A series of data in which the same variable or variables are measured at several points in time, with typically the time periods being spaced at equal intervals (days, months, years, decades etc.).

Total
The sum of a set of numbers. Calculated by =sum command.

Time trend
A time series beginning at 0 used as the independent variable (q.v.) in regression analysis (q.v.) when obtaining a trend line (q.v.).

Trend line
A line obtained by regression analysis (q.v.) which is the 'best fit' possible.

Variance
A measure of the dispersion of a series around the mean value, with the larger the value the greater the dispersion. It is defined as the sum of the squared deviations of the observations from the mean divided by the total number of observations, namely:

                _
v = (X-X)2/N

            _
where X is the mean

The variance is the square of the standard deviation (q.v.). Calculated by =var or varp, depending on whether you are using a sample or a population.

Weighted average
When calculating the arithmetic mean (q.v.) it is assumed that each and every variable has an equal weighting (implicitly that the sum of the weights is 1). Thus (where V and W denotes variables and weights respectively) an unweighted average can be written:

V1+V2+V3+V4 ... +Vnth
_______________________
                N

However many averages, particularly those produced from index numbers (q.v.), are actually composites of other series (for example, the RPI) in which the individual components are not equally weighted. Thus, for example, in calculating the monthly RPI statisticians at the CSO do not give the same significance to movements in the price of potatoes as they do to changes in the mortgage interest rate. A weighted average thus allows that each variable have a different weight attached, as in the following:

(V1*W1)+(V2*W2)+(V3*W3) ... (Vnth*Wnth)
___________________________________________
                                N

See also arithmetic mean.

To IT-MA home page
To Department of Historical Studies home page.


These pages are maintained and owned by Dr Roger Middleton

(c)R. Middleton 1997. Last modified 30 June 1998.