Blind Listening Letters part 3

Listening tests & discerning listeners
Editor: I think JA has shed some useful light on the continuing and vexing blind test debate, but I don't quite accept his hypothesis, as stated, that a blind listening test can conceal subjective differences. How then to explain the ability of participants with golden ears to hear those differences with very few errors, even in such tests?

Two things seem clear: that the ability to discern these differences is a learned skill, and that blind testing, as described, impedes that skill. Those of us whose work demands the evaluation of high-end products know that many audiophiles, even those outside the trade, clearly hear differences among amplifiers through extended listening, as in home evaluations, and generally agree in their descriptions of those differences, given the looseness of our language in these matters.

Why should this be surprising, or dismissed by those who don't hear them? How many artists or art dealers can choose, visually, the Rembrandt for the fraud? Who, but a violinist who has lived with one, can tell the Strad from the good copy? It takes time and practice to make fine distinctions.

Nor should the fallacy of the A-B test be surprising. Perhaps the most basic function of the brain is to seek and sense differences. (Did something just move in the leaves? Does that clamor of birds suggest the presence of a predator, or just a nesting dispute?) The second time we read a novel, see a movie, hear a piece of music, listen to a new record, the experience is clearly different from the first time.

I think the most significant aspect of his data, as JA suggests, lies in the vastly greater error in indicating a difference where none existed than in failing to mark a difference when there may have been one. His instinct was right in offering two warm-up trials before going on with the test, but I think that's not good enough.

Let me suggest an experimental format for further exploration of this question: Choose as test items a variety of two-minute musical excerpts, chosen for some aspect of their clarity or detail. Tell the participant that each item will be played three times without change, with 10-second pauses between repeats, and then, after the third pause (allowing enough time to switch amplifier outputs), the fourth play may be the same or different. I expect that while there will still be a tendency to err toward false differences, the results will have generally greater significance, and should provide, at least, a basis for selecting a panel of listeners with sufficient aural acuity to move on to the sticky business of making qualitative judgments between amplifiers.—Jerry Landis, Berkeley, CA

Listening tests & variable listeners
Editor: I would like to add a comment to the conclusions drawn on your recent amplifier comparisons as reported in "Blind Listening" published in the July Stereophile. I have read several reports on blind amplifier comparisons and have observed that the issue of the relative performance of different test subjects is often treated in a tentative or delicate way. For example, in your recent article, you comment (with apparent surprise) that over half of the test subjects did not do well, even though all are keen audiophiles. You offer poor listening conditions and lack of experience with concentrated listening as possible reasons for this result. While I agree that conditions and training will affect the scores of an individual, it is my experience that sensitivity to sound quality varies widely with individuals. This sensitivity does not seem to be learned. Some non-audiophiles that I know have it, and some audiophiles do not. Involvement with the field of audio does not seem to be a predictive factor for this.

If I am correct that sensitivity to sound quality varies widely with individuals and is based primarily on innate talent rather than experience, then why is this situation not more widely recognized? Are we concerned that, to be valid, our field of high-end audio has to be appreciated by the population at large; that, to be worth achieving, the differences in the sound of amplifiers must be acknowledged in a democratic fashion? Are we perhaps reacting to the charge that to hear a difference you would need "golden ears," implying an elitist or status-seeking stance? Maybe the next time a friend questions our interest in a high-performance component and says, "Gee, I'm sure I wouldn't be able to hear any difference," and we are about to respond in the usual way with: "Oh no, the differences are significant, anyone can appreciate them," we should answer instead: "Well, not everybody does hear the difference."

Thank you for pursuing research/educational topics in audio. I attended JA's demonstration on recording techniques at Stereophile's 1987 New York show and found it very informative.—Dean Fuller, Waltham, MA

Listening tests & audible differences
Editor: Thank you for trying blind amplifier testing. I fear you have let the statistics unduly influence your conclusions; 52.3% vs 50% is a small but statistically significant difference. It does not follow therefore that there were only slight audible differences. The differences could have been profound but only recognized by a few listeners.

Let me suggest a two-stage approach to analyzing this data. Consider some of the selections as subject qualification trials. Use them to identify qualified subjects. Then analyze the performance of the qualified subjects on the remaining selections. My guess is that you will be able to show strong audible differences in a sizable subset of your subjects. These differences appeared small only when diluted by the larger pool of subjects. Furthermore, I will argue that this methodology is sound because, under the null hypothesis, performance across trials should be independent. (If you do reanalyze the data, please use an exact test of its significance.)—Harry Lewin, Bronx, New York

Listening tests & unconfident listeners
Editor: One of a deluge of letters you're probably getting about the single-blind amplifier comparison at the 1989 Stereophile High End Show:

1) Since 30 of the 56 presentations involved different amps, isn't it true that if all subjects responded "different" to all presentations, then the rate of correct identifications would be 53.6% (ie, 100% x 30/56), suggesting erroneously that differences were indeed heard? If audiophiles do tend to say "different" more frequently than "same," then this would influence your test results in a similar—though less extreme—manner. Or am I misunderstanding the conclusion you seem to draw in the footnote on p.17 of the July issue?

2) If the aforementioned audiophile tendency accounted for the difference between the success subjects had for A-B or B-A comparisons vs with A-A or B-B comparisons, then perhaps one could conclude that, when faced with the HFN/RR drum track, audiophiles lose confidence and begin to guess "same" and "different" with equal frequency.

3) Is it possible that subjects must identify in their minds the identity of the amp during each presentation in order to be able to compare relevant performance parameters, in contradiction to your suggestion on p.8 that the "same"/"different" choice doesn't require this sort of identification? If so, then you could make the blind test easier for the subjects as follows: First describe the nature of the subjective differences existing between the two power amps to the subjects, so they know what to listen for. Second, for each paired presentation, inform the subjects of the identity of the amp playing during the first presentation. Thus subjects need only guess whether the second (unidentified) presentation is the same as the identified amp, or is the other amp. This is still a valid blind test.

4) Next time, run the high-scoring subjects through the test a second time to see whether they are simply "lucky coins" or truly skilled listeners.—Ralph Gonzales, Wilmington, DE

Listening tests & biased listeners
Editor: Congratulations on your recent amplifier listening test. This was, by far, the best audiophile-related blind test I know. The "forced choice" of a response of "same" or "different"—regardless of "better," "sweeter," or other adjectives—prevents many potential issues from clouding the results. The sheer number of samples makes the results statistically compelling. There was obviously a great deal of care and forethought in setting up the trials.

Thank you for publishing so much of your data—unlike the blind listening tests of, say, Stereo Review, having all of the data allows us to gaze at the numbers and think. Thinking about these numbers, I was startled to realize that Table 2 on p.15 implies a false-positive rate of about 62%. A false-positive is when the subject reports hearing a difference when the trial consists of two listenings to one amplifier. I had assumed that the same music played through the same amplifier would sound the same and that the subjects would report "different" only if they heard a difference, hence I expected a false-positive rate of zero.

An explanation for the high false-positive rate is subject-bias. Indeed, JA quoted a listener as saying "You have to care about whether there is a difference or not." Caring may be reflected in biased responses. The effect of listener bias on the results is dismissed in footnote 9 by what I consider handwaving, and footnote 8 describes a Chi-squared test that supports the audibility of differences in amplifiers with respect to guessing "different" 50% of the time. If the subjects are guessing "different" more than 50% of the time, and by chance, more than half of the trials use two amplifiers, then the expected results would show more than 50% of the responses are correct. For example, if subjects guess "different" 90% of the time and "same" 10%, and the trials have two amplifiers 90% of the time, the expected true-positive rate is 90% x 90% = 81%, and the expected true negative rate is 10% x 10% = 1%, for a total expected correct-response rate of 82%. Thus, a correct-response rate of more than 50% does not necessarily show audible differences—the success rate must be significantly more than the expected success rate due to (possibly biased) guessing. According to Table 2, there were 1134 + 758 = 1892 trials with different amplifiers, and 823 + 815 = 1638 trials with the same amplifier; thus, about 54% of the trials involved different amps. Thus, the question is: Are the results explained by biased guessing or is the success-rate significantly greater than would be expected by biased guessing?

The best estimate we have for listener bias is the false-positive rate: it should be zero, but Table 2 shows it to be 62% (that is, the number of incorrect A-A responses and B-B responses divided by the total number of "one amp" trials). Repeating the Chi-squared test mentioned in footnote 8 with a probability of guessing "different" of 62% (rather than 50%) and a probability of guessing "same" of 38% (rather than 50%) yields a Chi-squared statistic of 3.16, which is within a 95% confidence interval; that is, the results are most likely due to biased guessing.

The wealth of published data allows a deeper analysis: as there are two variables of two values (a "different" or "same" response, and "one amp" or "two amp" trial), a Chi-squared test with three degrees of freedom can be computed. If the results are due to biased guessing, the probability of a true-positive is the probability of guessing "different" when the trial has two amps: 62% x 54% = 34%, the probability of a false positive is (100%-62%) x 54% = 20%, true-negative = (100%-62%) x 46% = 17%, and false-negative = 62% x 46% = 29%. The Chi-squared statistic for these probabilities and the responses from the tests is 2.78, again within a 95% confidence interval.

In short, the data from the Stereophile blind listening test are most likely to be due to biased guessing rather than audible differences between amplifiers.

I don't necessarily like this conclusion—I have my biases, including a belief that some people can hear better than others. Such exceptional individuals may not be readily obvious from tests with a small number of trials per person, but such tests can be used to find people who are either able to distinguish between the sound of amplifiers or are "lucky coins." As such, I am disappointed by Michael Fremer's reluctance to continue with certain double-blind listening tests: we can conclude that either he is lucky or he can hear better than most people, and a few dozen additional blind trials would (probably) make clear which.—Kevin Willoughby, Framingham, MA