An Amplifier Listening Test Page 4
Our design was intended to cope with the problem that some musical selections might be easier to categorize than others, or that some selections would bias people to guess "different" or "same." As mentioned, such differences among the selections could spuriously inflate or deflate the accuracy of our listeners—we have no way of knowing which—and with only eight selections such effect would be unlikely to average out.
We therefore assigned each musical selection to all four possible different combinations of amps. Each pair of excerpts was therefore presented in four different ways (to different listeners, of course). Each listener heard all eight selections, but a given selection was "same" with the VTLs used twice for one group, "same" with two Adcoms for another, "different" with the order Adcom-VTL for another group, and finally "different" with the order VTL-Adcom for the last group. Thus, there was a total of four groups, each with two listeners in it, in a very well-balanced and controlled design, with every piece of music presented in every condition.
Under these conditions, the arithmetic mean accuracy was 75%, which is 25% better than chance—considerably better than the 1.4% over chance of the Stereophile study.
Performance for the eight listeners broke down as follows: One got them all correct, four got seven out of eight, one got six correct, one got four, and one got only two correct. A distribution like this is probably better represented by the median than the mean. The median accuracy is 84.4%. A third common measure of central tendency, the mode, is 87.5% correct.
Comparing this distribution to the appropriate binomial distribution (ie, the prediction based on chance) using the chi-square test, we get a chi-square of 90.14, which indicates that the observed distribution will be produced by a totally random process less than once in a thousand times. That is, if we repeated this study 1000 times with people who could not discriminate the amplifiers, only once would we get results this extreme. There is a caveat, however, associated with the chi-square test, and that is that it is unstable and possibly untrustworthy with a very small sample, such as the one we used. An alternative test for small samples is the Kolmogorov-Smirnov test. Applied to these data, this test puts the chance expectation well beyond the level of one in 100.
What accounts for the difference between our study and Stereophile's? There were two clearly audible differences between the amplifiers in our system that may have been reduced in the Stereophile test. The most prominent difference was in the highs. Sometimes—especially with cymbals and brushes—the Adcom's highs sounded ragged compared to the VTLs,' while on some material they simply sounded a bit louder. On most but not all classical music this difference made the VTLs sound more natural, but on the popular selections it sometimes gave the Adcom a little more excitement. The other difference, more subtle, was in imaging. The VTLs had a slightly deeper image and tended to define individual sound sources better than the Adcom. Thus, for example, with the VTLs one had the sense that individual voices in a chorus could be separately attended and placed in space, even counted, if one had the patience. The Adcom, while extremely clear and detailed, gave this sense less often.
These differences cause us to speculate that an important reason for the low identification accuracy in the Stereophile study was the crowded conditions in the listening room. Informal reports from those who participated in the study suggest that highs may have been muffled by people and couches placed very close to and in line with the tweeters. For those farther back in the room, attenuation of highs by all the bodies and tweed jackets in the way must have been severe. If the highs were significantly attenuated, important differences between the amplifiers could have been reduced to inaudibility.