Blind Listening Letters part 5

Listening tests: a thorough analysis
Editor: I have read with great interest your article, "Blind Listening," in the July 1989 Stereophile. In view of the importance of the subject, please allow me extended comments on methodology and analysis. Hopefully I am beating the drums neither for subjectivists nor objectivists, but for understanding of the data you present.

For brevity, let D represent "different," and S, "same." D trials refer to trials where the amps were different; D responses, to "different" answers. Similarly for S trials and S responses.

1) Your tests biased the correct-response rate upward, due to having more D trials than S trials, namely 1892 to 1638, or proportions of 0.5360 to 0.4640.

It is to be expected for two reasons that subjects in a state of uncertainty on a given trial would be more likely to answer D than S: a) The EC (emperor's clothes) syndrome may cause them to avow something they haven't experienced. Having been told that there are important differences between amps (in price, architecture, etc.), they expect differences and thus are more inclined to respond D than S in a state of uncertainty. b) Subjects, particularly audiophiles, tend to have "pride of hearing" that leads them to believe they hear differences, however subtle; they would rather be wrong in answering D than in answering S. For whatever reason, the inclination to answer D is indicated by the fact that 0.6312 of the responses were D, namely 2228/3530.

When the 0.6312 D response rate is coupled with the 0.5630 proportion of D trials, the expected proportion of correct responses attributable to chance is raised above 0.5000. In your study the expected proportion becomes 0.5360 x 0.6312 + 0.4640 x 0.3688 = 0.5094. What is happening in principle can be appreciated by using an extreme example: If subjects always answer D, and if all trials are D, the expected correct-response rate becomes 1.0000. Similarly, if there is a bias toward D responses coupled with a bias toward D trials, the correct-response rate is biased upward.

Your correct-response rate is 0.5229 (namely, 1846/3530). Seemingly, this is 2.29% above the rate that chance would tend to produce. But, as we have seen, what we may call D-bias produces an expected rate of 0.5094 due to chance. Therefore, your correct-response rate is only 1.35% above what chance would produce (namely, 0.5229-0.5094).

Note that the inclination to D responses would have no effect on the expected correct-response rate if D trials and S trials were equal in number.

2) Allowing for D-bias and using the binomial distribution as the correct probability model (rather than the normal distribution as an approximation of the binomial), what is the significance level of a 0.5229 correct-response rate? In other words, what is the probability of obtaining a rate as high as 0.5229 when chance (allowing for D-bias) would produce a 0.5094 rate? Now the significance level is 0.0555 instead of the former 0.0034 (you gave the significance level as "just over 0.001," based on the normal approximation). If one employs the conventional 0.05 criterion of significance, one could conclude that the correct-response rate of your study is not significant, particularly if one is an objectivist.

I don't buy that. It is better simply to state that the significance level is 0.0555 and let the reader make his own judgment as to significance of results (possibly the reader feels that 0.10 is a sufficient criterion of significance). Researcher and reader should recognize that the significance level is appreciably higher than first thought, but still quite low; and that the risk of Type 1 error (concluding that subjects can differentiate between amps when they really can't) is still a comfortably low 0.0555.

3) The D-bias readily explains why subjects have a higher correct-response rate on D trials than on S trials; that is, 0.6438 to 0.3834 (namely, 1218/1892 to 628/1638). If subjects are disinclined to answer S, to that extent they are unlikely to have correct responses on S trials. To illustrate by an extreme example, if subjects never answer S, they will have a zero correct-response rate on S trials.

4) The article may give the illusion to some readers that subjects can differentiate between amps 0.5229 of the time. However, one must correct for the fact that chance tends to produce a correct-response rate of 0.5000 (in the absence of D-bias). To exclude the effect of chance, a well-known formula is (footnote 2):

ph = (p-n)/(1-n) when p > n; otherwise p-n = 0

(ph is the correct-response rate based solely on hearing and not on chance; p is the actual correct-response rate; n is the rate attributable to chance.)

Therefore ph = (0.5229-0.5000)/(1-0.5000) = 0.0458. Thus we estimate that the subjects in your study can differentiate between amps about 4.6% of the time.

Repeating the procedure but this time using 0.5094 as the expected correct-response rate, allowing for D-bias, we obtain ph = (0.5229-0.5094)/(1-0.5094) = 0.0275.

Altogether, whether we do or do not allow for D-bias, the estimated correct-response rate based on hearing alone is unimpressive—either 2.75% or 4.58%. What the statistician calls "effect size" is quite small, possibly trivial in the view of some persons.

5) Your study shows that 0.2733 of the subjects had above-average correct answers, namely 5, 6, or 7. The binomial distribution provides an expected proportion of 0.2266. The significance level of this result—probability of getting a proportion of 0.2733 when we expect 0.2266 due to chance—is a low 0.0080. Therefore it appears realistic to conclude there are some persons who can truly differentiate between amps (KEOs or Keen-Eared Observers).

We may estimate the proportion of subjects who can truly differentiate between amps by employing the principle of the formula given in Point 4: (0.2733-0.2266)/(1-0.2266) = 0.0604. That is, we estimate about 6% of the subjects are KEOs.

However, this estimate does not allow for D-bias, which raises the proportion of above-average subjects due to chance. Let us assume that this chance proportion bears the same ratio to 0.2266 as does the chance rate of 0.5094 to 0.5000 for all responses. Therefore 0.2266 becomes 0.2309 (namely, 0.5094/.5000 x 0.2266). Accordingly, we obtain (0.2733-0.2309)/(1-0.2309) = 0.0551. Thus we have a not-too-rough estimate that about 5.5% of the subjects are KEOs.

6) Point 5 gives us a handle on an interesting figure: the percentage of audiophiles who can truly hear differences. This may provide a clue as to the potential market for high-end audio electronics. In view of the possible importance of the 0.0551 point estimate, it may be desirable to obtain an interval estimate as well. At the frequently employed 95% confidence level, the lower and upper confidence limits are respectively 0.0052 and 0.1085 (footnote 3).

In sum, we have a point estimate of 5.51% for the KEO percentage, and an interval estimate of 0.52% to 10.85% at the 95% confidence level.

7) I tend to agree with the person who objected that equating of amplifier levels should be on the basis of wide-band noise rather than on the basis of a 1kHz signal. Imagine that one amplifier operating into the chosen speaker has a perfectly linear response, while the other has a pronounced inverted-U response, say 3dB down at 50 and 10kHz. If their levels are made equal at 1kHz, the total output would be lower for the second amplifier than for the first. Quite likely a fair number of subjects would hear the difference in total output for much program material. If, more realistically, the difference in level is more in line with your Fig.4, some very keen-eared listeners may still hear a difference in total output.—Prof. Herman Burstein, Wantaugh, NY

My thanks both to those correspondents offering comments and to those who took the time to provide further statistical analysis of the data collected at the April listening tests. Indeed, I was both surprised and pleased that Professor Burstein took the time to carry out such a detailed investigation. Regarding the results, we only realized that there was an imbalance between the number of "Same" and "Different" presentations after the weekend's listening was over. I was wrong in assuming that this wouldn't affect the results. Such is life! In addition, the apparent commitment of the listeners toward detecting a difference was not something that could have been predicted ahead of time in any meaningful way.

Note that I used the term "commitment" rather than "bias," which is too loaded a word, I feel. The 505 participants obviously took the test extremely seriously and, according to the conversations I had after each session, tried very hard to answer correctly. The mechanism by which this "bias" operates, as noted by John Koval in the next letter, is probably a matter of answering "Different" when actually not sure.

Most of the correspondents who raised the point seem convinced that this built-in commitment for audiophiles to hear differences was a primary factor behind the detection of differences. However, having discussed this at length with my collaborator, Will Hammond, he suggests that this may not be as absolute a reason as, say, that suggested by Mr. Peutter and Dr. Carlson, supporting Professor Burstein's more cautious conclusions. (Incidentally, Mr. Peutter's statement in his otherwise excellent letter that the non-discriminatory drum recording was the only music selection where the number of "Sames" and "Differents" were approximately the same is not true. The drum recording featured 265 "Differents" and 239 "Sames," whereas the solo piano recording, which did prove discriminating, had 245 "Differents" and 260 "Sames.") For if it were the only explanation for successful identification, then why were the individual sessions so different? Looking at Table 2, on p.15 of the July issue, it can be seen that Sessions I, II, III, IV, V, and VII conformed to the general trend in that the rate of successful identifications was higher when there were more "Differents" than "Sames." Yet Sessions VI and VIII produced contrary results. Session VIII had three "Same" presentations and four "Different," yet the listeners overall scored worse than average at 48.1% correct identifications, while Session VI listeners scored better than expected despite having four "Sames" out of seven presentations.

It could be argued that these were random fluctuations due to chance. Yet, as each session featured a large "n"—the total number of trials—shouldn't it be expected that any supposed listener bias would still have made its presence known? Unless another factor were influencing the overall scoring, which I conjecture was the differing ability among the listeners to consistently detect subjective differences. Not all listeners will be KEOs. But some must be!

My original conclusions remain unchanged after this correspondence:

1) Hearing amplifier differences under blind conditions is not a trivial or easy task. However, the results tabulated in the July issue suggest to me that this was possible between the test VTL and Adcom amplifiers, even given the sub-optimal conditions of these tests. As pointed out above, taking the non-symmetrical Same/Different balance, the overall results still only just missed that 95% confidence level; ie, the risk of making a Type I error, falsely concluding that the listeners overall could hear a difference, is one in 18 rather than one in 20.

2) People seem to differ widely in their ability to hear such differences, but whether this ability is intrinsic or learned or both is open to question. (I suspect that, like ball handling and control, it is a mixture of both.)

3) In the case of the specific amplifiers and loudspeakers used in our tests, there were frequency-response differences that might well correlate with any aural identification. Further work—which does not consist of inserting active equalizers into the signal path—is required here.

4) Not all music is equally good at revealing subjective differences. As suggested by John Crabbe, percussive, transient-rich music seems to be less revealing of amplifier differences than music with more of a sustained nature.

5) Will and I intend to organize further tests, taking careful note of all the points raised in these letters, at the 1990 Stereophile Show to be held in New York City at the end of April. I hope we'll see you there. After all, one point not mentioned in any of these letters was that this kind of listening, if not exactly fun, can still be extremely stimulating.JA

Is JA a secret objectivist?
Editor: I'd like to congratulate John Atkinson on what appears to be a well-done blind listening test. Personally, based on the frequency-response difference that he measured between the amplifiers, I would have expected a more positive result in favor of an audible difference. However, the conditions appeared to be less than optimum, which probably accounts for the observed results.

Concerning the higher percentage of correct responses for difference as compared to sameness, I feel that this would be an expected result due to the "natural" bias of the majority of the audience. They are attuned to the idea that there should be differences and therefore would tend to select a difference when there was uncertainty, even though the amps were actually the same. This is no different from the situation with an audience naively disposed to the idea that there is no audible difference between amplifiers. They would, of course, be disposed to select no difference when there was uncertainty in their minds, even though there might be an actual audible difference.

On a slightly different note: For someone so vehemently against objectivists, JA is, I'm happy to say, with all his technical measurements, doing a great job of becoming one. I do realize that he seems to equate objectivists with those who believe there is no audible difference between amplifiers, etc. (I suppose that there are some who believe such a thing). But a more accurate definition would be those who believe there are measurable reasons for audible differences, and if there are no "significant" measurable differences, then there are probably no audible differences. I do hope he keeps up the good work, because I'm sure he will find that there is, in fact, a strong correlation between at least certain measurements and what we hear. I am, of course, a strong believer in frequency response as the magical measurement, and I hope he pursues the correlation.—John Koval, Santa Ana, CA



Footnote 2: For derivation of this formula, see my article in the May 1989 issue of JAES.

Footnote 3: For the procedure employed to obtain these confidence limits, see my article in the May 1989 issue of JAES.

X