The Highs & Lows of Double-Blind Testing Page 12

The significance of statistics can be seen with the experiment that is just one hair short of perfect. Suppose there is a one-in-a-million chance that the experiment is not perfect: in a million trials, a "false" will turn up as a "true" one time. The experiment is conducted and a "false" occurs. The experimenter is then killed in a freak accident before he can conduct any more trials. Here we can have a valid experiment (1/1,000,000 probability of Type 1 error) with untrue results. Statistical verification through repetition is thus really necessary, a prerequisite for valid results, but it is not the cause of those results.

Statistics can also verify biased results. A million trials of a biased test are just as invalid as one trial, but more reliable. The moral is that validity can only be determined by examining the test and its inherent characteristics. Leventhal is right by concluding that aggregation of "unfair" results is unfair, but he fails to examine the test itself for fairness. Statistics are just numbers. They are neither fair nor unfair. Numbers just don't care.

Fairness and high sensitivity are just what makes the ABX method so appealing: it contains the validity elements that constitute a fair test. Listener and administrator bias are controlled by concealing the identity of the device under test. The listener gets direct, level-controlled access to the device under test, the control device and X, with multidirectional switching and user-controlled duration. Contrast this to the open evaluation with usually no more than one or two switch trials, no controls over listener or administrator bias or level, often with references that aren't even present during the test, and no recorded numerical results or statistical analysis.

Which is the most fair?

How about sensitivity? Les Leventhal makes his entire fairness case around the idea that subtle differences may only be present 60-80% of the time during the tests. When p approaches 0.9 (differences present 90% of the time), the fairness coefficient evens up and even a 16-trial test meets all criteria for both Type 1 and 2 error. Notice that probability of error is not the same as actual error. Even a perfect one-trial experiment would have an unacceptably high risk of Type 1 and 2 error. So what makes for a sensitive listening test? What actual values can we expect for p?

A casual survey of any of the underground magazines shows that audiophiles typically find it fairly easy to perceive differences. Leventhal implies that p may be a low value when there is nothing in the audiophile position to support such a notion. Read any decent "audiophile" review and draw your own conclusion as to the value of p inherent in their position.

An examination of the 16-N tests referenced by Dr. Leventhal reveals conditions indicative of high sensitivity. Clark and Greenhill auditioned the devices under test prior to the test to identify sonic characteristics. The ABX blind tests were performed using their personal reference systems, with familiar program material and at their leisure. I find it difficult to believe that this procedure might have a sensitivity of under 0.9.

A low sensitivity value of, say, 0.6 for p suggests that for every 10 trials only 6 real trials occur. Thus one must increase the sample size to add enough real trials to avoid Type 2 error. A low-sensitivity test of 16 trials is only a 10-trial test under these conditions. If the differences are only present on 60% of all the program material available, and if your material is chosen from a random sample, then the sensitivity issue might apply. However, the identification of material where differences are present is imperative for sensitive testing. It also enables us to test for differences that may only be present 10%, or even 1%, of the time. We can make these tests by selecting programs in which differences are present 100% of the time during the test. It seems to me that this is what audiophiles do, and precisely what Clark and Greenhill, Shanefield, Lipshitz and Vanderkooy, et al, do also.

For tests using listener groups it may be difficult to give all listeners completely sensitive programs. However, because the sample is now much larger, only 100 total trials are needed to reduce the risk of Type 2 error to less than 1% with a listener sensitivity of 0.7. Using 10 listeners in a 16-trial test would mean 160 total trials.

I find it interesting that no one has difficulty discovering differences during subjective evaluations. However, during the open sessions I've participated in the general sensitivity level of the listeners often seems to be greater than one (p equal to or greater than 1.0). Differences abound. However, sometimes these differences mystically disappear under blind conditions. Why? It seems to me that many of them are a part of the relationship or interface between the listener and that gear. The things the listener hears are as much a part of the listener as they are a part of the equipment. Withholding the identity of the equipment breaks the bond with the listener and the differences disappear.

As an audiophile, it is important to me to know which differences are attributable to the equipment alone. Those which are part of the listener interface may not apply to me. The ABX method is the only test I am aware of that makes this important distinction. It is the only one that has both scientific validity and statistical reliability. I don't doubt that listeners and golden ears hear what they hear, but there is scant evidence that others would hear it. While the debate rages on, I will devote my energy to areas where there is no argument about the existence of major differences. Loudspeakers, anyone?—Thomas A. Nousaine, Chicago, IL