The Highs & Lows of Double-Blind Testing Page 2
Editor: JGH's assertions in Vol.8 No.5 that he knows what the ABX box does and no longer needs it seem a bit disingenuous. The double-blind comparator exists because of the experiences of people who, like JGH and other subjective reviewers, noticed differences in components and wanted a faster and more reliable way to identify them. When, after level and frequency response had been very carefully equalized, many of those differences proved inaudible, it came as an unpleasant surprise. People need to trust their senses; that's why trguments about the nature of reality are the fiercest kind.
As LA commented, I and others have done many experiments in which component differences were clearly audible despite any problems in experiment, system, room, or listeners as long as we pushed only buttons A or B effectively duplicating the subjective reviewer's non-blind procedure (meaning you always know what you're listening to). When we start pushing the button labeled X (footnote 1), which connects either component A or B (but only the comparator's microprocessor knows which), the choice suddenly seems more difficult. Lo and behold, our guesses prove inaccurate from three to seven times out of ten, in dicating that we couldn't tell the difference after all.
I don't mean to suggest that the ABX test isn't sensitive. Under many circumstances, it does what was originally hoped; that is, demonstrates quickly and easily that a difference is audible. In an AES workshop last October we tested for the audibility of the Carver CD-fixing box, the so-called Digital Time Lens, using pink noise as a source, and 124 out of 124 responses were correct.
But subtler characteristics may be harder to identify with the comparator, especially given the habitual rapid switching that the device seems to encourage. While it's true that it can be used for long-term blind testing, no one seems to have the patience. Yet another interpretation of the first story is that the anxiety produced by listening to the unknown decreases the sensitivity of the listeners. That anxiety can raise sensory thresholds is well-proven.
These or other mechanisms may at any time, give a false negative result in a test for audibility. I can never disprove the existence of sonic characteristics that for some reason don't show up in a double-blind test. But some differences, including many that seem quite subtle, do show up in such trials. The distinction between the two kinds of characteristics is a useful one: I think those that do show up in double-blinds are more important, and more worth spending money on, than those that don't. Many people disagree; that's what keeps high-end audio alive.—E. Brad Meyer, Lincoln, MA
Then, in Vol.9 No.2, Les Leventhal, of the University of Manitoba's Psychology Department, dropped a bomb into the pro-ABX waters by contributing an article based on his Audio Engineering Society paper, "How Conventional Statistical Analyses Can Prevent Finding Audible Differences In Listening Tests," Preprint 2275 (C-9), which had been presented at the 79th AES Convention in New York, October 1985:
The Highs & Lows of Double-Blind Testing
In his response to a letter from reader C.J. Huss (Vol.8 No.5), Larry Archibald stated that "subtle differences between products widely acknowledged to sound different have not been corroborated" in double-blind A/B tests which use the ABX comparator. Mr. Archibald suggested two possible explanations: (1) Double-blind tests using the ABX comparator show that the subtie differences are imaginary, and (2) Double-blind tests using the ABX comparator somehow fail to reveal true subtle differences.
I do not know whether these "subtle differences" are real or imaginary. But I do know that many listening tests using the ABX comparator, including many published tests such as those in Audio cited by reader Huss, are conducted and analyzed in such a way that subtle differences actually heard by the listener will likely go unidentified by the experimenter when the data is analyzed. The problem with these listening studies is that the experimenters conducted too few trials (for example, 16), and used the .05 level of significance when subjecting the data to a statistical test of significance. Only in a large-trial listening study can the results be tested at a significance level as small as .05 without the risk of overlooking small differences becoming unacceptably high. To see why this is so, a little background in statistics (having nothing to do with audio) is necessary.
Footnote 1: Readers who would like to know, in depth, what the ABX comparator does are referred to J. Gordon Holt's review and editorial in Vol.5 No.5. Briefly, the subject has the opportunity to choose either of two components to listen to, labeled A and B. There is a third button on the comparator, however, labeled X, which chooses A or B without the subject knowing which. The comparator keeps track of what was chosen. The subject, after choosing X a predetermined number of times (say, 10) and attempting to identify what X was, can then check the comparators memory to see how he or she did. It is still possible to) choose A or B to verify one's memory after commencing with X.—Larry Archibald