Answers and listener comments are here. Listeners are automatically made anonymous until I verify that it's okay to release their name or nickname.
Formal Analysis:
Friedman Non-Parametric Analysis (reference: Sensory Evaluation Techniques, 3rd Ed., Meilgaard, Civille, and Carr)
Goal: Determine if the null hypothesis (the mean ratings for all of the codecs are equal) is untenable. The alternative hypothesis is that the mean ratings of at least two of the codecs are different. If the value of the Friedman type statistic (T') exceeds a critical value, then the null hypothesis is rejected in favor of the alternative hypothesis. If the T'-statistic is significant, then multiple comparison procedures are applied to determine which codecs have significantly different average ratings.
Assumptions: The Friedman test does not assume normality of the distributions for the sample populations, as does a one-way blocked (or correlated) ANOVA analysis, which would be the appropriate method otherwise. However, the Friedman test does assume that the populations have the same distribution, except for a possible difference in the population medians. Thus the Friedman will not address the problem of inequality of variances. Also, as with the one-way blocked ANOVA (which assumes normal distribution for the sample populations), it is assumed that the measurement errors are identically distributed and independent of each other, and that there is no interaction between blocks (listeners) and treatments (codecs).
Raw data

You can perform the analysis described below using a web-based utility I have written. Copy and paste the following raw data into the data entry box (I eliminated listener 11's data , which was incomplete):
% Dogies raw data MPC AAC WMA OGG LAME XING 5.0 2.5 1.5 1.5 1.5 1.0 3.5 2.7 1.8 2.5 2.2 1.0 3.2 3.0 2.7 2.6 2.0 1.0 3.0 4.1 0.9 1.5 3.0 2.0 4.0 4.5 1.0 2.0 3.0 1.0 2.5 4.0 3.0 3.0 2.0 1.0 4.0 3.0 3.0 3.0 2.0 2.0 4.0 5.0 2.0 4.0 3.0 1.0 5.0 5.0 4.0 3.5 3.0 1.5 4.7 3.7 4.5 3.8 4.0 3.5 5.0 4.8 5.0 4.2 4.0 4.0 5.0 5.0 4.0 4.5 5.0 3.8 5.0 5.0 5.0 4.5 5.0 3.0 5.0 5.0 5.0 5.0 4.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0
The first step in a Friedman analysis is to convert the ratings into rankings. In the current test, rankings from 1 to 6 ares assigned. In the case of ties, the average of the tied ranks are assigned to each of the samples that could not be differentiated. For instance, in this six-sample test, if the middle two samples (normally of ranks 3 and 4) could not be differentiated, then both the samples were assigned the average rank of 3.5. I eliminated listener 11 since it was missing the data for WMA, then converted the ratings into rankings as follows:

b = number of blocks (listeners) = 15
t = number of treatments (codecs) = 6
Each entry in the matrix above is labelled xij, where i denotes rows (listeners), and j denotes columns (codecs). Thus x14, for example, is listener 1's ranking of Ogg.
The test statistic, T' is given below (Hollander and Wolfe, 1973):

where G = bt(t+1)/2, gi is the number of tied groups in block i and tij is the number of samples in the jth tied group in block i. (Nontied samples are each counted as a separate group of size ti,j = 1.)
The "dot" in x.j indicates that summing has been done over the index replaced by the dot, that is:

I have peformed these calculations in a zipped Microsoft Excel spreadsheet. I have also written a web-based utility to perfom this type of analysis (and others). The test procedure is to reject the null hypothesis of no sample differences at the alpha level of significance (I used alpha = 0.05) if the value of T exceeds chi-square and to accept the null hypothesis otherwise, where chi-square is the upper-alpha percentile of the chi-square distribution with t-1 degrees of freedom. The procedure assumes that a relatively large number of listeners participate in the study. It is reasonably accurate for studies involving 12 or more listeners.
Results: T' = 39.9, Chi-square = 11.1, level of significance << 0.001 (highly significant)
In other words, the null hypothesis is rejected -- the mean ranking of at least two of the codecs are different from each other with a confidence greater than 95%.
Since the chi-square statistic is significant, a multiple comparison procedure is performed to determine which of the samples differ significantly. The nonparametric analog to Fisher's Least Significant Difference (LSD) for rank sums from a randomized (complete) block design is:
![]()
Two samples are declared to be significantly different at the alpha level if their rank sums differ by more than the value of LSDrank, which for this test, turned out to be 20.08 (alpha = 0.05, degrees of freedom = infinity). The rank sums of the different codecs are as shown below. The means of codecs identified with the same color (for example mpc and aac) are not significantly different from each other.
| mpc = 75.0 | aac = 71.5 | wma8 = 50.5 | ogg = 49.5 | lame = 44.5 | xing = 24.0 |
Informal Analysis
The following graphical analysis is a qualitative look at the data, with the purpose of observing how the ratings are affected by listener sensitivity.
Listener ranking
In order to get a better idea of how subjective quality ratings varied with listener sensitivity, I sorted the listeners by the average of the ratings they assigned. In general, a sensitive listener would be expected to have a lower score than a less sensitive listener. This method of ranking listeners is not perfect, though. For example, each listener must define for himself what is "annoying." Two listeners of equal sensitivity might rate the same artifact quite differently depending on their personal definition of "annoying" or "very annoying." In the test design, I attempted to mitigate this problem by providing an "anchor" encoder which most listeners of at least medium sensitivity could agree is quite bad (Xing). This encoder could be expected to often earn a rating of 1 (very annoying), and could be used to put the other encoder defects into common perspective.
Note that the rating method here is quite different from the one that the MPEG group uses when it tests codecs, even though the actual scale is the same. In the MPEG tests, the listeners are all trained and keen of hearing, and the ratings scale is used in an "absolute" sense. That is, certain encoded samples are made available to train the listeners and to "calibrate" everybody's perceptions.
Fitting to error function
Next, I plotted each codec's rating versus listener (ordered by sensitivity). Ideally, I would expect that for a very large sample of listeners, the distribution of rating versus listener would take the shape of an error function. That is, insensitive listeners on one end of the spectrum would be expected to rate a codec near 5 and very sensitive listeners would be expected to rate a codec near 1. Such a curve has a characteristic flattened "s" shape to it. Using Microsoft Excel to change the shape of the error function, I fitted a curve to each codec's data by eyeballing it. Some data fitted much more nicely than others. For example, compare the fit of Xing to the fit of MPC. The procedure I'm using is a very crude "latent trait analysis." It should be taken for what it's worth though. For one thing, the small sample size limits its usefulness. Also, such an analysis assumes that both "listener sensitivity" and "codec quality" are things that vary only in quantity, not in character. But the extent to which this is true is not explored here. Suffice it to say that different listeners are sensitive to different things and their quality ratings will vary depending on individual preferences. But flawed though the method may be, I believe it is still has some things to say.
One thing to note is that all listener data is useful in this type of analysis, from the least sensitive to the most. Also, perhaps surprisingly, the relative ranking of codec quality does not necessarily have to remain the same as listener sensitivity varies.
Comparisons
After fitting each of the codec data to an error function, I then overlaid these curves to show how each codec might ideally be expected to perform versus listener sensitivity. Xing brings up the bottom, as expected. MPC and AAC are far above this, also as expected. In the middle, bunched closely together are WMA8, Lame and Ogg. It's hard to say with certainty if there is a clear winner in this group, although there's a suggestion that WMA8 is least preferred of this bunch. As for AAC vs. MPC, there's an interesting intersection of the two curves and the suggestion that the most sensitive listeners prefer MPC over AAC, but that listeners who are slightly less sensitive prefer AAC. However, MPC's curve fit was not very good, so there is a good deal of uncertainty about this conclusion.
Other notes
Listener 11 did not rate the WMA sample.
One thing which stands out in the listener comments about Ogg RC2 is the added hiss. Assuming this is fixed in RC3 and that the other sound qualities remain the same or better, Ogg has the potential to break away from the middle pack. Further listening tests will test this assertion.
One shouldn't forget that this analysis applies only to the particular sample listened to. Other samples will certainly stress the various encoders in different ways, and listener rankings are sure to change because of this. A valid encoder comparison should use a variety of different samples.

Fitting to error functions






Overall ranking based on fit to error functions
