IT1 Logo

WEEK 7+ ADDITIONAL EXCEL EXERCISE I: 
ANALYSING CROSS-SECTIONAL DATA 
 
 
 

 
 
 
 
 

[p015aa]



Introduction
This first additional EXCEL exercise, an extension of one contained in Judge (1990, ch. 5), uses data from the Monopolies and Mergers Commission report into National Coal Board (now British Coal) deep mines for 1981/2 (HMSO 1983) to show how cross-sectional data can be analysed using a spreadsheet.

The objectives of these exercises are:

Some summary statistics which detail the long-run contraction of the British coal industry, together with a near-doubling in labour productivity, are given in Table 1 below:

NCB operations, selected years 1947 and 1981/2
 

 

 

1947

1981/2

1.

No. of NCB producing collieries at

958

200

 

year end

 

 

2.

Output (m tonnes):

 

 

 

NCB deep mined

187.5

108.9

 

Opencast

10.4

14.3

 

Licensed

2.1

1.1

 

TOTAL

200.0

124.3

2.

Average colliery labour force

703.9

218.5

 

('000s)

 

 

3.

Output per man-year (tonnes)

267.0

497.0

SOURCE: Ashworth (1986, table A1)

Exercise 1: calculating measures of central tendency and dispersion for two areas
The resource file for this and subsequent exercises is EH088.XLS. This contains the deep mines' operating results for 1981/2 for the twelve NCB regions. To begin with we will work with the North East and South Wales, respectively parts B and L of the resource file (the colliery number, name and output data). Accordingly, copy the labels (cells A12:C14) and the data for each of the two regions required (respectively cells A34:C55 and A226:C258) to a new worksheet, which should be saved to your floppy disk. You will find it much easier to follow these class notes if, when you copy the resource data, your top left-hand cells are A2 for the labels, A 6 for B. North East and A31 for L. South Wales.

Casual inspection of the data for the two regions suggests that there was great variation in the saleable output of each colliery and that the average output of collieries in the North East was rather larger than that of South Wales. Our first exercise, therefore, is to calculate some summary statistics which will reveal the exact differences between the regions. These statistics form the core of what are known as measures of central tendency and dispersion.

Since we are going to sort our data, we need to make a copy first of the colliery numbers and the saleable output. Assuming that you have copied the resource data as detailed above, copy cells A6:A8 to I6:I8 and cells C6:C8 to J6:J8. Repeat the same exercise for the South Wales data. Then, enter the following into cells L6:L15 and L31:L40:

Summary statistics:
___________________

Minimum
Maximum
Mean
Median
Mode
Standard deviation
N
Coefficient of variation
___________________

You will need to increase the column width, by dragging the column right border with the mouse, to accommodate the full width of these labels. Next, we are going to use a shortcut facility for entering formulae, that of naming a block of cells and using the name in the formular rather than the cell addresses. Begin by highlighting the saleable output figures for the North East. Next, click in the name box on the formula bar (the section on the left which usually contains the cell address). Type NE in the box and then press Enter. Do the same for South Wales, naming it bSW. With this stage completed save your file.

We are now ready to enter the formulae that relate to our labels. Instead of pointing for each of the statistical measures we are about to use, you just place the range name in brackets, i.e. =min(NE) for the minimum value of the North East data range.

Ignoring the median and coefficient of variation for the moment, use the following formulae in col. (J) in relation to your labels: minimum =min; maximum =max; average =average; mode =mode; standard deviation =stdev; and N =count.

NB There are two functions that calculate the standard deviation: stdev and stdevp. The former is used to calculate the standard deviation of a sample drawn from a population, the latter for the whole population. In the case of NE and SW you are using a sample, but will later use the national output figures.

The standard deviation is a measure of the dispersion of a series around the mean value, and the larger its value the greater the dispersion (defined as the square root of the sum of the squared deviations of the observations from the mean divided by the total number of observations; see Floud 1979, p. 74). However, when the means differ so greatly between two samples, as they will in this case, the standard deviation is of limited use. Instead compute the coefficient of variation using cell references:

Standard deviation
________________*100
Mean

Since this is unit free you can use it to compare the dispersion of samples which have very different means. Finally, we compute the median - the middle value, another measure of central tendency. For this we have to sort the data into ascending order. This could be done manually, though it would be very tedious. Instead, EXCEL can automate the process. First, select the figures to be sorted and then select Data/Sort from the main menu and ascending from the dialogue box. Repeat the exercise for South Wales. The median for the North East, where there are 22 observations, will be midway between the eleventh and twelfth observations; that for South Wales, where there are 33 observations, the seventeenth. If all has gone well you should have the following results:
 
 

 

North East

South Wales

Minimum

76.0

62.0

Maximum

1,763.0

589.0

Mean

604.3

224.6

Median

376.5

189.0

Mode

-

189.0

Standard deviation

480.6

130.1

N

22.0

33.0

Coefficient of

79.5

57.9

variation

 

 

 

Both the standard deviation and the coefficient of variation provide a measure of the variability of observations around the mean. However, they give no information about the distribution of observations around the average, i.e. skewness. Whilst the difference between the mean and the median gives some sense of how skewed is the distribution, a more suitable measure is Pearson's modified coefficient of skewness, defined as:

3*(Mean-Median)
_______________
Standard deviation

(See Yeomans 1968, I, pp. 117-18 for a discussion of alternative measures and why Pearson’s modified coefficient is to be preferred).

Calculate this coefficient for our two regions. You should obtain results of 1.42 and 0.82 for the North East and South Wales respectively. A symmetrical distribution would yield a result of 0 (i.e. mean and median, and mode, were identical); a positive result for a right-skewed distribution and a negative result for a left skewed result. The higher positive result for the North East than South Wales demonstrates that it has a longer tail of lower values to the right. All of this will become clearer as we chart the series as frequency distributions.

First, we need to sort our data once more, this time into class intervals (as in the table below). With your cursor in col. (K) insert three columns. This will push your Summary statistics block three to the right. In cell K8, the first row of data for the North East, start to input the class intervals, beginning with <100 and working through to 1700-99 in intervals of 100 (as in the first column of the table below).

In col (L), and matching the data in col. (K), insert the upper class intervals (from 99 to 1,799) using the Edit/Fill/Series facility. This is called the bin range and it should accommodate all the observations you have for the North East (lowest 76, highest 1,763). To produce the frequency distribution, first ensure that you have two columns empty to the right of your data. Select Tools/Data Analysis, then Histogram. (NB If Tools/Data Analysis is not showing in your menu, first select Tools/Add in and then click on Analysis ToolPak). Next, insert the appropriate Input range (NE or SW), bin range and output range (the top left hand cell of the empty columns). A block of numbers will now appear in cells M8:M26. These tell you the number of observations that fall within each class, the final number - which can be deleted - telling you that there are no observations above the top interval specified in the bin range.

Complete the same exercise for the South Wales data. You should obtain the following results recorded in the table below. This frequency distribution shows how different the dispersion is, as well as the central tendency, of the two regions. This will become even clearer as we produce histograms, i.e. column graphs of these distributions.

Frequency distribution of saleable output ('000 tonnes): North East and South Wales regions
 

 

North

South Wales

 

East

 

0-99

1

4

100-99

3

14

200-99

5

9

300-99

3

1

400-99

0

3

500-99

3

2

600-99

0

0

700-99

0

0

800-99

1

0

900-99

0

0

1,000-99

1

0

1,100-99

1

0

1,200-99

2

0

1,300-99

0

0

1,400-99

1

0

1,500-99

0

0

1,600-99

0

0

1,700-99

1

0

 

Accordingly, the final part of this exercise is to produce the following two column graphs. The X-axes are the class intervals and the series the frequencies. Your two graphs will show why the Pearson modified coefficients of skewness are so different and thus the difficulties about generalising from the experience of any one pit without knowledge of their central tendency and dispersion.

 

Exercise 2: from sample to population - summary statistics of national coal production
The purpose of this exercise is to take the notion of significance one step further. So far you have summary statistics on two regions, or two samples. But to know their significance you need summary statistics of the whole population. Accordingly, input the whole of the data on saleable output in the resource file into your worksheet, just to the right of your existing data. To make this easier it has been provided as a continuous series in cells I15:J212 of the resource file. Proceed as before to calculate the summary statistics (including the mode) and to produce a histogram (as below) of the frequency distribution of total NCB coal output.

 

Think about the representativeness of the two sample regions, the North East and South Wales. For the record, it appears as if Donisthorpe, in the South Midlands, is a typical pit in that its output was equal to the national mean. (This could be done by manual inspection, but a much faster method is to use Edit/Find to find whether there is an entry for 539).

Exercise 3: working with grouped data
Having worked with ungrouped data, and calculated the standard measures of central tendency and dispersion, we now progress to the same calculations but with grouped data. Historians are frequently presented with data that has already undergone some processing, by for example being placed in class intervals, and the procedure for calculating the mean and standard deviation is rather different. (A problem emerges where the initial and terminal class intervals are open-ended. For example, if instead of the four class intervals after 100 there had only been one (say 100) a decision would have had to be made; see Judge 1990, pp. 84-5 for one solution to this problem).

Your resource file is EH095.XLS which contains data on the top 50 British companies by market valuation for 1904/5 and 1934/5. We will work with the former year, so copy the class intervals and frequencies to a new sheet in your file.

Replace the existing labels in cols (A) and (B) with Market value &pound;m and No. of firms respectively. Our problem is that whilst we know the number of firms within each class interval we do not know the value of the individual observations. Statisticians finesse this problem by taking the mid-point of each class interval, making the implicit assumption that for any class interval the actual observations are uniformly distributed around this mid-point.

Label column (C) x (the mid-points) and add f (the frequency) to the end of the label of column (B). The mid-points for each class interval now need to be inserted in col. (C). The first class interval is, in effect, for any value greater than 0 but smaller than 10, conventionally from 0.01 to 9.99. Taking the lower from the upper limit and dividing by 2 yields a mid-point of 4.99. Increment this by 10 for each successive class interval to produce a series that runs from 4.99 through to 134.99.

The formulae for the mean and standard deviation are in turn:

X = fx/ f
___________
s = (( fx2/ f)-X2)

where X is the mean; is the symbol representing the sum of the argument; f the frequency; x the mid-point; s the standard deviation; _ represents the square root of the argument; and a number is squared as follows:

X2 = X^2

Thus col. (D) and (E) become - and should be labelled as such - f.x and f.x^2 respectively. Use the appropriate formulae to derive these estimates, the =SUM function to derive fx and f.x^2  and the =SQRT for the square root. It is then very straightforward to produce the mean and standard deviation: respectively 27.6 and 30.5 (had we done the calculation from the ungrouped data in EH095.XLS the following results are obtained: 28.4 for the mean and 29.9 for the standard deviation, i.e. grouping does affect the results). Finally, graph your grouped data as a histogram.
 

References

Ashworth, W. (with assistance of M. Pegg) (1986) The history of the British coal industry. Vol. 5: 1946-1982: the nationalized industry.
Floud, R. (1979) An introduction to quantitative methods for historians, 2nd edn.
Haskins, L. and Jeffreys, K. (1990) Understanding quantitative history.
HMSO (1983) Monopolies and Mergers Commission, National Coal Board: A report on the efficiency and costs in the development, production and supply of coal by the NCB. 2 vols. Cmnd. 8920.
Judge, G. (1990) Quantitative analysis for economics and business using LOTUS 1-2-3.
Yeomans, K.A. (1968) Statistics for the social scientist. Vol. I: Introducing statistics.

To next additional exercise.

 
To IT-MA home page
To Department of Historical Studies home page.


These pages are maintained and owned by Dr Roger Middleton

(c)R. Middleton 1997. Last modified 30 June 1998.