An introduction to sphericity
The article is written for a general audience of postgraduate and graduate researchers. The technical material goes slightly beyond what is covered in most text books, although there is still some simplification (which is usually indicated in the text). The aim to give advice about best practice for checking and dealing with sphericity in repeated measures ANOVA. Some of the content is personal opinion (which I have tried to indicate in the text). I include a short bibliography of my sources at the end for readers who want to explore the topic in more detail.
Some background
Sphericity is a mathematical assumption in repeated measures ANOVA designs. Let's start by considering a simpler ANOVA design (e.g., oneway independent measures ANOVA).
In independent measures ANOVA one of the mathematical assumptions is that the variances of the populations that groups are sampled from are equal. This homogeneity of variance assumption moreorless follows from the null hypothesis being tested in ANOVA â?“ if the treatment has no effect on the thing being measured (the DV) then we can consider all the groups to be sampled from the same population.^{1} However, because we're taking samples we'd be very lucky to observe exactly equal variances (even if the assumption were perfectly met). Real data is rarely that neat! What we'd expect to get (most of the time) is groups with similar variances.^{2}
The sphericity assumption can be thought of as an extension of the homogeneity of variance assumption in independent measures ANOVA. Why does the assumption need to be extended? To understand this we need to introduce the ANOVA covariance matrix.
The covariance matrix
What is a covariance matrix. In a nutshell it is a matrix that contains the covariances between levels of a factor in an ANOVA design.^{3} A covariance is the shared or overlapping variance between two things (sometimes called variance in common). Let's look at an example of the layout for a onefactor ANOVA design with four levels and therefore four samples (called A_{1}, A_{2}, A_{3} and A_{4}):
Samples: 
A_{1 } 
A_{2 } 
A_{3 } 
A_{4 } 
A_{1} 
s_{1}^{2} 
s_{12} 
s_{13} 
s_{14} 
A_{2} 
s_{21} 
s_{2}^{2} 
s_{23} 
s_{24} 
A_{3} 
s_{31} 
s_{32} 
s_{3}^{2} 
s_{34} 
A_{4} 
s_{41} 
s_{42} 
s_{43} 
s_{4}^{2} 
The first thing to notice is the main diagonal cells in the matrix (running top left to bottom right) contain the variances of the four levels (e.g., s_{1}^{2 }is the variance of A_{1}).^{4 }The second thing to notice is that the covariances are therefore in the cells off the main diagonals (called the offdiagonal cells). The third thing to notice is that the covariances are mirrored above and below the main diagonal. (The term s_{14}^{ }is the covariance between samples A_{1} and A_{4}, while s_{41}^{ }is the covariance between samples A_{4} and A_{1}. As this is the variance they have in common, s_{14}^{ }= s_{41}.)
What does the covariance matrix look like for independent measures ANOVA? Here an example for a oneway independent measures ANOVA design with 4 levels (and hence four groups):
Samples: 
A_{1 } 
A_{2 } 
A_{3 } 
A_{4 } 
A_{1} 
s_{1}^{2} 
0 
0 
0 
A_{2} 
0 
s_{2}^{2} 
0 
0 
A_{3} 
0 
0 
s_{3}^{2} 
0 
A_{4} 
0 
0 
0 
s_{4}^{2} 
The most striking observation is that all the covariances are zero. Why? The answer is fairly straightforward. In an independent measures design the observations should be independent and therefore uncorrelated with each other.^{5} Two samples that are uncorrelated will share no variance (and the covariance will be zero). So in this relatively simple case we only have to worry about homogeneity of variance â?“ which would lead us to expect that the observed variances on the main diagonal should be similar.
Reminder: Assumptions such as homogeneity of variance, sphericity and so forth are assumptions about the populations we are sampling from. I'll try and indicate this as I go through, but it sometimes gets clumsy to keep repeating "in the population being sampled" all the time! We expect samples to have similar characteristics to the populations being sampled, but only in rare cases will the samples show exactly the same pattern of variance (or whatever) as the population. It is also worth adding that large samples are more similar to the populations they are sampled from than small samples.
Finally, please note that the statistical term "population" is an abstract one. We are referring to a population of data points that we might potentially be sampling not a fixed entity such as the population of a country. (In other contexts, such as market research, people sometimes deal with such fixed populations, but this requires slightly different methods from those in most sciences).
What is the sphericity assumption?
Compound symmetry
The sphericity assumption is an assumption about the structure of the covariance matrix in a repeated measures design. Before we describe it in detail lets consider a simpler (but stricter) condition. This one is called compound symmetry. Compound symmetry is met if all the covariances (the offdiagonal elements of the covariance matrix) are equal and all the variances are equal in the populations being sampled. (Note that the variances don't have to equal the covariances.) Just as with the homogeneity of variance assumption we'd only rarely expect a real data set to meet compound symmetry exactly, but provided the observed covariances are roughly equal in our samples (and the variances are OK too) we can be pretty confident that compound symmetry is not violated.
The good news about compound symmetry
If compound symmetry is met then sphericity is also met. So if you take a look at the covariance matrix and the covariances are similar and the variances are similar then we know that sphericity is not going to be a problem.^{6}
The bad news about compound symmetry
As compound symmetry is a stricter requirement than sphericity we still need to check sphericity if compound symmetry isn't met. This is where it gets technical (well, even more technical).
The sphericity assumption
Lets take a look at the raw data. Imagine that the first few observations of A_{1}, A_{2}, A_{3} and A_{4 }are as follows:

A_{1 } 
A_{2 } 
A_{3 } 
A_{4 } 
_{Participant 1} 
8 
9 
12 
4 
_{Participant 2} 
6 
11 
16 
3 
_{Participant 3} 
9 
8 
12 
5 
_{etc.} 
... 
... 
... 
... 
For each possible pair of levels of factor A (e.g., A_{1} and A_{2} or A_{2} andA_{3}) we can calculate the difference between the observations. For example:

A_{1}A_{2 } 
A_{1}A_{3 } 
A_{1}A_{4 } 
etc. 
_{Participant 1} 
1 
4 
+4 

_{Participant 2} 
5 
10 
+3 

_{Participant 3} 
+1 
3 
+4 

_{etc.} 
... 
... 
... 
... 
We could then calculate variances for each of these differences (e.g., s_{12}^{2 }or s_{24}^{2}).
The sphericity assumption is that the all the variances of the differences are equal (in the population sampled). In practice, we'd expect the observed sample variances of the differences to be similar if the sphericity assumption was met.
Using the covariance matrix to check the sphericity assumption
We can check sphericity assumption using the covariance matrix, but it turns out to be fairly laborious. (Later on I'll discuss some simpler ways to check sphericity using output from SPSS and similar statistics packages). Variance of differences can be computed using a versio of the variance sum law:
s_{xy}^{2 }= s_{x}^{2} +s_{y}^{2}  2(s_{xy})
In other words the variance of a difference is the sum of the two variances minus twice their covariance. A simple check will show that this works out as zero if the two variances share all their variance.
(Note that we could also calculate the variances of the differences directly from the raw data. We'd simply calculate the differences between all the possible pairs of levels of a factor. For example, using Excel or SPSS we could define a new column value as one level minus another level and then calculate the variances of each column using the built in descriptive statistics of the program. This would get very laborious if we had lots of levels, so I'd recommend either the above method if you really want to calculate the variances of the differences and you already have a covariance matrix. Fortunately this isn't necessary in most cases â?“ as I'll discuss later.)
An example
This example is adapted from Kirk (1995). Imagine the observed covariance matrix for our design above is this:
Samples: 
A_{1 } 
A_{2 } 
A_{3 } 
A_{4 } 
A_{1} 
10 
5 
10 
15 
A_{2} 
5 
20 
15 
20 
A_{3} 
10 
15 
30 
25 
A_{4} 
15 
20 
25 
40 
s_{xy}^{2 }= s_{x}^{2} +s_{y}^{2}  2(s_{xy})
s_{12}^{2 }= 10 + 20  2(5) = 20
s_{13}^{2 }= 10 + 30  2(10) = 20
s_{14}^{2 }= 10 + 40  2(15) = 20
s_{23}^{2 }= 20 + 30  2(15) = 20
s_{24}^{2 }= 20 + 40  2(20) = 20
s_{34}^{2 }= 30 + 40  2(25) = 20
This example has been contrived so that variances of the differences are exactly equal (which would be unusual in real data), but it does demonstrate that lack of compound symmetry does not necessarily mean that sphericity is violated. (Compound symmetry is a sufficient, but not necessary requirement for sphericity to be met.)^{7} In this example, compound symmetry is clearly not met (the largest variances and covariances are 4 or 5 times bigger than the smallest), but sphericity holds.
What to do if sphericity is violated in repeated measures ANOVA
There are two broad approaches to dealing with violations of sphericity. The first is to use a correction to the standard ANOVA tests. The second is to use a different test (i.e., one that doesn't assume sphericity).
In the following subsections I give general advice on what to do if sphericity is violated, this advice tends to hold well in most cases for factorial repeated measures designs but may be problematic for mixed ANOVA designs (discussed later under Complications).
Correcting for violations of sphericity
The best known corrections are those developed by Greenhouse and Geisser (the GreenhouseGeisser correction) and Huynh and Feldt (the HuynhFeldt correction).^{8} Each of these corrections works roughly in the same way. They all attempt to adjust the degrees of freedom in the ANOVA test in order to produce a more accurate significance (p) value. If sphericity is violated the p values need to be adjusted upwards (and this can be accomplished by adjusting the degrees of freedom downwards).
The first step in each test is to estimate something called epsilon.^{9}^{ }For our purposes we can consider epsilon to be a descriptive statistic indicating the degree to which sphericity has been violated. If sphericity is met perfectly then epsilon will be exactly 1. If epsilon is below 1 then sphericity is violated. The further epsilon gets away from 1 the worse the violation.
How bad can epsilon get? Well, it depends on the number of levels (k) on the repeated measure factor.
Lower bound of epsilon = 1/(k1)
So for 3 levels epsilon can go as low as 0.5, for 6 levels it can go as low as 0.2 and so forth. The more levels on the repeated measures factor the worse the potential for violations of sphericity.^{10}
The three common corrections fall into a range of most to least strict. First consider the most strict. We could use the lower bound value of epsilon and correct for the worst possible case. Fortunately, there is a much better option. The GreenhouseGeisser correction is a conservative correction (it tends to underestimate epsilon when epsilon is close to 1 and therefore tends to overcorrect). HuynhFeldt produced a modified version for use when the true value of epsilon is thought to be near or above 0.75.
The HuynhFeldt correction tends to overestimate sphericity, so some statisticians have suggested using the average of the GreenhouseGeisser and HuynhFeldt corrections. My advice would be to consider the aims of the research and the relative cost of Type I and II errors. If Type I errors are considered more costly (especially if the estimates of epsilon fall below 0.75) then stick to the more conservative GreenhouseGeisser correction.
Using the correction is fairly simple. Replace the treatment and error d.f. by (epsilon*d.f.). So an epsilon of 0.6 would turn an F_{1,10} test into an F_{0.6,6} test.^{11 }
Using the these corrections seems to work well for relatively modest departures from 1 by epsilon or when sample sizes are small.
Using MANOVA
An alternative approach is to use a test that doesn't assume sphericity. In the case of repeated measures ANOVA this usually means switching to multivariate ANOVA (MANOVA for short). Some computer programs print out MANOVA automatically alongside repeated measures ANOVA. While this can be confusing, it does make it easy to compare results for different tests and corrections. If sphericity is met (i.e., epsilon = 1) all the p values for a given test should be identical. The degree to which they differ can be informative. If there is a wide discrepancy between different tests or correction then this suggests that the sphericity assumption may be severely violated and that one of the more conservative tests should be reported (e.g., GreenhouseGeisser or MANOVA).
In general MANOVA is less powerful than repeated measures ANOVA and therefore should probably be avoided. However, when sample sizes are reasonably large (n > 10+ k) and epsilon is low (< 0.7) MANOVA may be more powerful and should probably be preferred. Other factors (such as the correlations between samples) can influence the relative power of MANOVA and ANOVA, but are beyound the scope of this summary.
How to check sphericity
In this section I will focus on information readily available in SPSS (and most good statistics packages).
Factorial repeated measures ANOVA
If there is more than one repeated measures factor consider each factor separately. (See also Special cases: factors with 2 levels.)
A warning about Mauchly's sphericity test
Many text books recommend using significance tests such as Mauchly's to test sphericity. In general this is a very bad idea.
Why? First, tests of statistical assumptions â?“ and Mauchly's is no exception â?“ tend to lack statistical power (they tend to be bad at spotting violations of assumptions when n is small). Second, tests of statistical assumptions â?“ and, again, Mauchly's is no exception â?“ tend not to be very robust (unlike ANOVA and MANOVA they are poor at coping with violations of assumptions such as normality). Third, significance tests don't reveal the degree of violation (e.g., with large n even a poor test like Mauchly's will show significance if there are very minor violations of sphericity; with low n the poor power means that even severe violations may not be detected). Fourth, significance tests of assumptions tend to be used as substitutes for looking at the data â?“ if you followed the advice of many popular texts you'd never look at the descriptive statistics at all (e.g., the variances, the covariance matrix, estimates of epsilon and so forth). Fifth, I don't like them.^{12}^{ }
Using Mauchly's sphericity test
The test principle is fairly simple. The null hypothesis is that sphericity holds (I like to think of it as a test that the true value of epsilon = 1). A significant result indicates evidence that sphericity is violated (i.e., evidence that the true value of episilon is below 1).
Epsilon
I would recommend using estimates of epsilon to decide whether sphericity is violated. If epsilon if close to 1 then it is likely that sphericity is intact (or that any violation is very minor). If epsilon is close to the lower bound (see above) then a correction or alternative procedure such as MANOVA is likely to be necessary. Exactly where to draw the line is a matter of personal judgement, but it is often instructive to compare p values for the corrected and uncorrected tests. If they are fairly similar then there is little indication that sphericity is violated. If the the discrepancy is large then one of the corrections (or MANOVA) should probably be used.
The covariance matrix
If estimates of epsilon are not readily available then lowerbound procedures can be used (see above) or the covariance matrix can be consulted. If compound symmetry holds then it is safe to proceed with repeated measures ANOVA. If compound symmetry does not hold it is relatively simple (if timeconsuming) to calculate the variances of the differences for each factor from the covariance matrix.
Complications
Special cases: factors with 2 levels (and the paired t test)
If k = 2 (a repeated measures factor with only two levels) then the sphericity assumption is always met. Using the lowerbound formula one can see that when k = 2 epsilon can't be lower than 1/(k1) = 1/(21) = 1. This is also true for the paired t test (in effect a oneway repeated measures ANOVA where k = 2).
Why isn't sphericity a problem when there are only two levels? Well, think about the covariance matrix:
Samples: 
A_{1} 
A_{2} 
A_{1} 
s_{1}^{2} 
s_{12} 
A_{2} 
s_{21} 
s_{2}^{2} 
There are two covariances s_{21} and s_{12}. The covariances above and below the main diagonal are constrained to be equal (because the shared variance between level 1 and level 2 is the same thing as the shared variance between level 2 and level 1). In effect there is only one covariance. Similarly, if we calculated the variance of the difference for s_{12}^{2} we should realize there is only one such variance. Sphericity is met if all the variances of the differences are equal. As there is only one, it can't not be equal to itself. For information Mauchly's sphericity test can't be computed if d.f. = 1 (i.e, if k = 2) and some computer programs give confusing messages or printouts if you try.
Note that sphericity subsumes the standard homogeneity of variance assumption. In effect, we are only interested in the variances of the differences. When k = 2 there is only one variance of the difference between levels and we can ignore differences in the 'raw' level variances themselves.
Multiple comparisons
In general Bonferroni t tests are recommended for repeated measures ANOVA (whether or not sphericity is violated). The Bonferroni correction relies on a general probability inequality and therefore isn't dependent on specific ANOVA assumptions. As Bonferroni corrections tend to be conservative, a number of modified Bonferroni procedures have been proposed. Some are specific to certain patterns of hypothesistesting, but others such as Holm's test (or the similar Larzelere and Mulaik test) are more powerful than standard Bonferroni corrections and should be used more widely (and not just for ANOVA).
Most statisticians seem to recommend specific (rather than pooled) error terms for repeated measures factors (i.e., calculate the SE for t using only the conditions being compared, rather than using the square root of the MSE term from the ANOVA table). This advice also extends to contrasts which can be easily calculated by performing paired t tests on weighted averages of the appropriate means. Using a specific error term should avoid problems with sphericity (e.g., see Judd et al., 1995) for moreorless the same reason that sphericity is not a problem for factors with only 2 levels.
Mixed designs
Mixed designs (combining independent and repeated measures factors) muddy the waters somewhat. Mixed measures ANOVA requires that multisample sphericity holds. This moreorless means that the covariance matrices should be similar between groups (i.e, across the levels of the independent measures factors). Provided group sizes are equal (or at least roughly equal) the GreenhouseGeisser and HuynhFeldt corrections perform well when multisample sphericity doesn't hold and can therefore still be used. If these corrections are inappropriate, or if group sizes are markedly unequal then more sophisticated methods are required (Keselman, Algina & Kowalchuk, 2001). A description of these methods is beyond the scope of this summary (possible solutions include multilevel methods found in SAS PROC MIXED, MlWin and HLM, though Keselman et al. also discuss a number of other options). If at all possible researchers should keep group sizes in mixed ANOVA equal or as close to equal as possible.
Bibliography
Field, A. (1998). A bluffer's guide to ... sphericity. The British Psychological Society: Mathematical, Statistical & Computing Section Newsletter, 6, 1322.
Howell, D. C. (2002). Statistical methods for psychology. (5th. ed.). Belmont, CA: Duxberry Press.
Judd, C. M., McClelland, G. H., & Culhane, S. E. (1995). Data analysis: continuing issues in everday analysis of psychological data. Annual Review of Psychology, 46, 433465.
Keselman, H. J., Algina, J., & Kowalchuk, R. K. (2001). The analysis of repeated measures designs: a review. British Journal of Mathematical and Statistical Psychology, 54, 120.
Kirk, R. E. (1995). Experimental design: procedures for the behavioral sciences. (3rd ed.). Pacific Grove: Brooks/Cole.
Footnotes:
^{1} As long as the treatment only has the effect of adding or subtracting to the group means (and doesn't influence their variances) the homogeneity of variance assumption isn't a problem. This special case is known as unittreatmentadditivity. Unfortunately life isn't always that simple: there are good reasons why treatments might be expected to influence both means an variances. For this reason it is always sensible to check the group variances in independent measures designs.
^{2} As a rule of thumb the largest group variance should be no more than 3 or 4 times as large as the smallest group variance.
^{3} Covariance matrices also crop up in all sorts of other statistics, but we can forget about that for now.
^{4}^{ }The diagonals contain the variances because samples share all of their variance with themselves. I've used s rather than the Greek sigma symbol because it turns out better when browsers with different fonts are used. s is normally used for samples and sigma for populations, but I'm using s interchangeably in this example. You can generate Greek letters by using the "symbol" font in many word processors (e.g., 's' for sigma, m for mu and so forth).
^{5 }The covariance between groups will rarely be exactly zero in the samples. However, provided people are randomly assigned to groups, and each person contributes only one data point then we can be pretty certain that the covariances in the populations being sampled are zero (and therefore the independence assumption is met). Even if random assignment to groups doesn't occur the independence assumption is often reasonable. Any time we know or believe that the measures will be correlated (e.g., in matched designs) a repeated measures analysis should be used.
^{6}^{ }I probably should have mentioned this earlier, but covariances (like correlations) can be both negative and positive (unlike variances which are always positive). Positive covariances occur between samples when two samples are positively correlated. Negative covariances occur between samples when two samples are negatively correlated. The idea of a negative covariance is often tricky to grasp â?“ but it just means that as one group tends to vary upwards in value the other tends to vary downwards. So when checking covariances to see if they are similar bear in mind the sign of the covariance as well as its magnitude (e.g., 124.3 is very different from +124.3).
^{7}^{ }By now you can probably appreciate why many text books focus on compound symmetry and don't cover sphericity in detail.
^{8}^{ }One of the nice things about this topic is that the tests have nice, proper statisticalsounding names.
^{9}^{ }The Greek letter epsilon is usually used. GreenhouseGeisser estimates of epsilon have a little hat on top (^). HuynhFeldt estimates have a little squiggle on top (~). You can generate Greek letters by using the "symbol" font in many word processors (e.g., 'e' for epsilon).
^{10}^{ }Later on we discuss the special case of k = 2 and the analagous case of paired t tests. Feel free to jump there now if you wish.
^{11 }You won't find tables for fractional d.f., but exact p values can be calculated if d.f. are fractional (most good computer packages do this automatically these days).
^{12}^{ }Why don't I like them? Apart from all the above reasons, I don't like the idea of using a significance test to test the assumptions of a significance test. If that was a good idea, why don't we use significance tests to test the assumptions of Mauchly's sphericity test or Levine's test of homogeneity of variances? At some point you've got look at the data (using graphical methods, descriptive statistics and so forth) and make a considered judgement about what procedures to use.