Thus, with a 16-trial listening test analyzed at the conventional .05 level of significance, the probability of the investigator overlooking differences so subtle that the listener can correctly identify them only 60% of the time is a whopping .8334! Accordingly, when true differences between components are subtle, it is not surprising that 16-trial listening tests with (or without) thc ABX comparator typically fail to find them. What if 50 trials are run? The table shows that the investigator must require the listener to make 32 or more correct identifications (r = 32) to conclude that…
If an editor, in response to the above, tells us that steps have been taken to eliminate this prejudice or that bias from the reviewers, the editor will have missed the point. The point is that there are many commonalities among people in general and underground equipment reviewers in particular, some known and probably some unknown, that may produce similar errors in seemingly independent reviews. At best, an editor can take steps to eliminate or counteract the effects of only the known commonalities. However, the strength of the double-blind (or single-blind) method is that it eliminates…
Let's suppose we altered our statistical criteria, as Leventhal suggests, so that it would be possible to conclude—with some certainty—that a difference was not heard in a test. What might we accomplish and what is the price we have to pay? We might prove that these listeners in this room didn't hear a certain difference using somebody's "reference" equipment. Who cares? Some other group may well be able to hear this difference. The price we pay is lost statistical power to prove what we really want to know: what it is that we can hear. Our "can hear" information is useful to all audiophiles…
The Example: Since most of the points to be discussed are statistical, it will be helpful to focus on a concrete example. Consider Greenhill and Clark's assessment of the McIntosh MC 2002 amplifier in Audio (April '85, pp.56-60). The McIntosh was compared in a double-blind test to another amplifier, apparently the Levinson ML-9. The listener (Greenhill) correctly identified "the randomly selected amp in only 10 out of 16 trials," a rate which failed to reach "the desired 95%" level of significance." On p.96 of the same issue of Audio, Greenhill and Clark discuss their methodology and state…
Then Mr. Clark addresses the price of following Leventhal's advice to reduce Type 2 error: "The price we pay is lost statistical power to prove what we really want to know: what it is that we can hear?" Mr. Clark goes on to describe the importance of information we would lose. Mr. Clark's essay is actually quite moving, except for one small detail: he has reversed the actual relationship between Type 2 error and statistical power. The truth is that anything which reduces the Type 2 error will INCREASE, not decrease, statistical power. So the consequence of Leventhal's advice would be reduced…
While the significance test technique used by Greenhill and Clark is reasonable and conventional for their data, nevertheless there are mathematically derivable implications and consequences of using this technique to make decisions, consequences with which most audio engineers, including Mr. Clark, are apparently unfamiliar. All I did was to point out some of the more important implications, those regarding Type 2 error and statistical power. These implications can be found in most elementary textbooks on statistics and are not at all controversial. If Mr. Clark does not like the…
Letters in response appeared in Stereophile, Vol.9 No.8, December 1986: The Double-Blind Debate #1
Editor: In JA's "Two-Cents' Worth" conclusion to "The Double-Blind & the Not-so-Blind" (Vol.9 No.5), he pondered the existence of an audio subculture believing that most pairs of similarly described audio components sound the same. I, for one, can tell you why I am a member of that subculture. Due to the margins for error correctly pointed out by Mr. Leventhal, I cannot at this time demand your acknowledgment that my audio subculture is not as blind as you say. However, I would…
The Double-Blind Debate #2
Editor: I am writing these words of response to the brouhaha over double-blind testing as a consumer and cover-to-cover reader of Stereophile. After dulling my brain on all the statistical volleying (I've lost track of who's on which side), it became clear that the real argument was whether or not there is a significantly audible difference between different makes of hi-fi equipment. Each side used "scientific conclusions" regarding double-blind testing to support a subterranean contention.
I put forward these reasons why, regardless of statistical…
The significance of statistics can be seen with the experiment that is just one hair short of perfect. Suppose there is a one-in-a-million chance that the experiment is not perfect: in a million trials, a "false" will turn up as a "true" one time. The experiment is conducted and a "false" occurs. The experimenter is then killed in a freak accident before he can conduct any more trials. Here we can have a valid experiment (1/1,000,000 probability of Type 1 error) with untrue results. Statistical verification through repetition is thus really necessary, a prerequisite for valid results, but it…
Following my reports on 13 mainly inexpensive loudspeakers that have appeared in the last four issues of Stereophile, I thought I would give myself a treat this month by reviewing the latest incarnations of a model that has stood the test of time: the two-way Celestion SL600Si...This is a carefully tuned infinite-baffle design, sacrificing ultimate extension for upper-bass and lower-midrange quality. Its crossover is conventionally British in that it puts flatness of on-axis amplitude response ahead of time coherence, while everything about it, from drive-units to the cabinet itself, is…