The Highs & Lows of Double-Blind Testing Page 6

Let's suppose we altered our statistical criteria, as Leventhal suggests, so that it would be possible to conclude—with some certainty—that a difference was not heard in a test. What might we accomplish and what is the price we have to pay? We might prove that these listeners in this room didn't hear a certain difference using somebody's "reference" equipment. Who cares? Some other group may well be able to hear this difference. The price we pay is lost statistical power to prove what we really want to know: what it is that we can hear. Our "can hear" information is useful to all audiophiles because they know that they may be able to hear it as well. Proof of hearing also lends credence to opinions expressed by the listeners as to which sound they prefer.

Our analysis has helped us prove that audiophiles can hear small amounts of flutter, response aberrations, and distortion, as well as differences between CD players, hi-fi VCRs, and phono pickups. Leventhal's statistics would have us sacrifice the certainty of hearing a difference to make the insignificant statement that "In this test, no difference was heard."

Leventhal seems insensitive to the needs of a music listener in subjective testing. He suggests increasing the number of trials (difference identification attempts), or the number of listeners, to restore the statistical power lost in his analysis method. This may work for psychology experiments, running laboratory rats through a maze back at the university, but listeners need a transparent test that encourages their most sensitive performance and ends before they become jaded or emotionally drained. Also, large numbers of qualified listeners are hard to find.

What Leventhal fails to acknowledge is that our tests do use a very large number of trials, we simply do not report each of them individually. Many changeovers between sound A and sound B are made by the listeners before arriving at a decision which is then reported as a single trial. This is done to accommodate the natures of music and of the human decision-making process. It also relieves the listeners of the distraction of recording each and every comparison. A single listener may make over 1000 comparisons in the course of arriving at the 16 decisions we call "trials." I suspect that Professor Leventhal's insistence on tests that conform to his statistical rigor is a result of his never having personally organized or participated in a blind listening test.

I welcome constructive criticism, but I don't find it in the tone or substance of Leventhal's letter. The ABX Comparator system, which I helped develop, has been refined during the 10 years of its existence by the suggestions of many audiophiles and scientists. Some hardware improvements of this system can even be traced to the pages of this magazine (Vol.5 No.5). Other inputs have resulted in the development of double-blind listening tests which require no switching. The reason for perfecting listening tests is to develop the ability to hear sonic improvements when they exist as sound, rather than as mere claims. To quote the esteemed J. Gordon Holt on the subject of double-blind testing, "The losers will be the dissemblers, the frauds, and those skilled in the art of autohype. The winners, ultimately, will be music and the rest of us who are interested in the maximal fidelity of reproduced music."—David Clark

Les Leventhal Responds
David Clark and other double-blind experimenters have been subjected to a great deal of vitriolic criticism, in my opinion undeservedly. I believe Mr. Clark, Dan Shanefield, and others have made an enormous contribution to the science of listening tests, and I have great respect for their work. But nobody except the Almighty is perfect—not even me. So it is with the objective of improving already good work that I commented on their research in my previous writings and that I comment now on Mr. Clark's letter.

Mr. Clark took issue with my letter which attempted to summarize in nontechnical language some of the points in a conference paper I presented to the Audio Engineering Society. That paper, revised and expanded, is due to appear in the June, 1986 issue of the Journal of the Audio Engineering Society under the title, "Type 1 and Type 2 Errors in the Statistical Analysis of Listening Tests." References I make below to "my paper" refer to the journal paper, not the conference paper.

Before I get carried away puffing a lot of academic hot air, let me briefly and simply state what I think Mr. Clark disagrees with—it is difficult to be certain—and my response. The main point of contention is small-N listening tests, ie, listening tests with a small number of trials and/or listeners. Mr. Clark, for a variety of reasons stated in his letter, disagrees with my contention that small-N listening tests analyzed with a statistical test of significance employing the .05 level of significance will suffer a large risk of overlooking small-to-moderate audible differences when they are present (large risk of Type 2 error), and only a small probability of finding them (small statistical power). (Mr. Clark and other researchers have published a large number of small-N studies employing the .05 level of significance.)

My reply is simply that my position consists of basic, uncontroversial statistics, demonstrated mathematically in my paper, and can be found in any good textbook in elementary statistics. A reader who believes me can stop here. A reader who doesn't will have to suffer through what follows: a painstaking unraveling of the tangles of Mr. Clark's logic, showing that at the heart of each objection is a misunderstanding of elementary statistics or research methodology.