Audio Research SP9 preamplifier Letters 2/88
Editor: I am a newcomer to the fascinating, esoteric, and totally liberating world of audio perfection. Where else could one find such carefree disregard for practical considerations? But this should not mean idiosyncratic subjectivity.
The review of the Audio Research SP9 preamplifier in the November issue of Stereophile seems an anomaly. It raises serious questions of whether your reviews can be trusted. In light of Terry Dorn's manufacturer's comment, I am puzzled why you published your review at all. JA's and JGH's negative findings do not make sense, considering that the SP9 works well for so many people—it is hard to believe that there are 1000 audiophiles who cannot tell "hard, rough, thin, and both spatially and dynamically compressed" sound from warm, round, full, and expanded sound.—Michael M. Piechowski, Washburn, WI
More on the SP9
Editor: I bought an Audio Research SP9 preamplifier before reading your report in Stereophile's November issue (Vol.10 No.8). After listening to the unit in the dealer's store, I took it home and put it in my system. After a couple of days, I concluded that I could live with the sound and bought one. However, the unit I received in a sealed box from the factory sounded different from the unit I had checked out. The sound was a bit smoother, a little warmer, and more open than the first unit; I liked this one better.
But after about two months of use, the sound changed. String music became hard and shrill, with less openness and depth.
I returned the unit to ARC and have now gotten it back. I checked it out and the unit sounds a little smoother, but the rest of the sound is about the same.
According to the repair order ARC sent me, they put in "compatibility Mod. Board and update," whatever that means.
I am using the SP9 with the Counterpoint SA-20 amplifier and PS Audio CD-1A player, with Interlink Reference A cables used between every unit. —Anthony Mattina, Staten Island, NY
Editor: Please cancel my subscription to Stereophile, the main reason being J. Gordon Holt's review of the Audio Research SP9 preamplifier in Vol.10 No.8, along with John Atkinson's added few words. I feel that these two men's days of reviewing are numbered. The adjectives used to describe the sound of the SP9 in the second paragraph on p.113 are the exact opposite of what my wife, a friend of ours, and myself heard. (There was also no crosstalk between line-level inputs.) When you read a review that is so entirely different from what you hear, then it is time to move on to more truthful sources.
I've talked with five different dealers about this review. (Two were not ARC dealers, the other three were.) All five said that the review was purely political—I'd say it was purely bull. It is almost as if JGH and JA had a grudge against the Audio Research Corporation.
One final thing: Yes, I did compare the SP9 with other preamplifiers, all costing between $995 and $3250. And yes, the SP9 beat them all by hundreds of miles in overall sound. And yes, I do own one. As a matter of fact, I own two, one in my living room and one in my den.—Name and address withheld on request
Editor: I would like to comment on several points regarding your review of the Audio Research Corporation SP9 preamplifier.
First, the blind comparisons of the SP9 vs the SP11 in England must be interpreted with caution. In four independent trials with a probability of 1:2 for guessing correctly on each, the likelihood of a run of four successes is 1:16; if four misidentifications are also interpreted as meaningful, the probability of either of these outcomes occurring purely by chance is 1:8. This is not sufficient to reject the null hypothesis that JGH could not distinguish between the two preamplifiers under these conditions.
Of course, if JGH consistently misidentified the two components over sufficiently many trials, one would conclude that he heard a difference—and, based on his strongly stated judgments about the characteristic sounds of the SP9 and SP11, that he actually preferred the SP9 in the context of the Absolute Sounds system and room. This result might be even more distressing to Audio Research Corporation—but would underscore the important role of component interactions, room compatibility, and the subjective element in comparing one aural perspective to another. —David Bear, MD, Nashville, TN
Editor: I'm delighted to see that after 20 years, you have a Musician in Residence. After reading JA's analysis of the blind listening test comparing the Audio Research SP9 and SP11 (Vol.10 No.8, p.116), I suggest that a Statistician in Residence also be added to your staff. JA's analysis suggests that he may have been listening to good music when he should have been attending mathematics classes.
The two listening tests are treated as independent events. This could be valid—JGH and JA do have different ears. This also could be invalid—if JGH and JA discussed what they heard as they were listening, the trials are not independent. But for lack of any information, let's accept the assumption of independence.
A score of four correct identifications out of four trials surely is significant. I assume that JA would claim that three correct identifications out of four trials is significant. (The advice of a Statistician in Residence could have prevented the need to make assumptions.) Zero out of four is claimed to be as significant as four out of four—this is correct, as the statistics are being used to imply an audible difference between the two preamps rather than to show the ability to identify which is which. Similarly, one out of four is as significant as three out of four. Thus, two out of four is the only score that does not suggest an audible difference.
Of the 2x2x2x2 possible outcomes of four trials, six have two correct identifications. Thus, given two identical preamps and a single test of four trials, the statistics would imply the preamps sound the same 6:16, or 38% of the time, and appear to sound different 10:16, or 62% of the time.
Comparing two identical preamps by two independent listening tests of four trials each, the results will imply no audible difference in 38%x38% or 14% of the time, hence imply an audible difference 86% of the time. "Certainly, it is incontrovertible that a difference was heard" is a stronger statement than the statistics justify.
If we take the stricter assumption that only all-right or all-wrong tests show significant differences, then 2:16 or 12% of the time, identical preamps will have results that imply audible differences. For two independent tests, at least one of the tests would imply audible differences 23% of the time.
If we take the strictest assumption that four correct identifications in four trials is the only significant result, then a four-trial test of two identical preamps will imply a difference 1:16, or 6% of the time. For two independent tests, at least one will imply an audible difference 12% of the time. Thus, even with the strictest interpretation available, the word "incontrovertible" could not be justified.
As noted above, two tests of four trials each will show an audible difference 62% of the time for identical preamps, but the confidence that can be placed in such is quite limited. Perhaps your Statistician in Residence would be able to explain "confidence intervals," which give a measure of just how much confidence can be placed in a statistical implication.—Kevin Willoughby, Framingham, MA
Editor: I just received the first issue of Stereophile offered to me as a special promotion, and I am writing to comment upon a matter of great concern. I refer to your critique of the Audio Research SP9 preamplifier and your defensive reaction to the results of the ad hoc "blind" test performed under duress. As a vision researcher and psychophysicist, I deplore your interpretation of the results, and the claims you make regarding statistics.
In this test, two observers were required to identify correctly two devices, each presented twice. The observer could on any trial expect to guess correctly 50% of the time. The probability of an individual guessing all four presentations correctly is therefore one in 16 (one half to the fourth power). The same is true for guessing all four incorrectly. Thus your comment that J. Gordon Holt's performance was extremely unlikely is false.
The real problem with your analysis is the criterion for reliable detection of real differences: you state that a perfect inverse correlation from one of two observers proves that there are real differences between the devices. Since you were satisfied that one observer guessed wrong every time, it is reasonable to assume that you would have been even happier had he guessed right every time. This doubles the probability of the result to one in eight. Because two observers participated, the probability that one or the other would guess all right or all wrong is double this, or one in four. Clearly we must charitably assume the results are due to chance; otherwise, we must assume that your hypothesis that the SP11 is better than the SP9 is false because of the negative correlation.
Your misguided notions of statistical validity are not what concern me most, however. Instead, it is your view that "blind" testing is of little value that astounded me. You apparently fail to realize that any subjective evaluation of stereo components is essentially a perceptual experiment; only the methodologies vary. The type you prefer is suprathreshold, and qualitative in nature. This approach is of course essential in art; to judge music otherwise would be meaningless. But you are evaluating products, not music. A tempting model for this sort of experiment is the evaluation of wines. Descriptors such as body, sweetness, acidity, and others have some generally acknowledged sensory correlates, and numerical scales can even be applied to describe how sweet, how acidic. This model fails for audio, though, because the taste of wine is the experience intended by the vintner. Audio equipment reproduces music; it should be judged as ersatz wine.
Sensory substitutes can be judged in two ways: testing substitutes vs the original, or testing substitutes vs each other. Your preferred method is the former: you judge reproduced sound vs your memory of what the sound should be.
However, your readers may be interested in relative differences, which require the latter method. The easiest way to design an accurate and reliable test of relative differences is to use a two-alternative, forced-choice method; a "blind" A/B test is a rather weak variant of this. Contrary to your belief that "proving anything from blind testing is extremely hard," it is actually easy to test hypotheses this way if you know what you are doing.
The most important point that I can make is that whether you judge the absolute sound of the device, or the relative sound of two devices, you must understand that you are performing an experiment. You must control stimulus and subject parameters. You seem to have a grasp of the stimulus; efforts have been made to ensure similarity of presentation, although this is of course difficult for situations where physics gets in the way, as in the case of loudspeaker-room interactions. However, you have not allowed for significant variations in listener performance. The most significant of these is state of adaptation; we do not know if your listener has previously been sitting quietly at home, driving on noisy streets, or listening to music for several hours.
Other problems include subject age and experience. The results of poorly designed tests using qualitative descriptions are so strongly influenced by uncontrolled factors that the results are dubious at best. Because of this, it is actually easier to design tests of relative differences, using a forced choice of better or worse. Of course, the criterion of quality must be that of sound reproduction, so subjects must not be allowed to detect differences based upon equipment appearance or reputation; hence the tests must be "blind."
How rigorous need you be? Had you tried eight presentations instead of four, there would only have been one chance in 64 of one observer achieving a perfect positive or negative correlation by chance. You could just as easily have presented more trials. Still, this would not have been a good test for two reasons. First, stimulus parameters were not controlled. JGH thought the unfamiliar environment caused his poor performance, even though prior to the test your team thought that informal listening demonstrated the differences you claimed. Second, subject parameters were not controlled because, as you admitted, the observers had "partied a bit the night before." Fatigue and hangover definitely affect performance. Do they ever affect others of your evaluations?
You may argue that my criticism is unfair because you did not set up the test. A well-designed test would succeed regardless of the identity of the examiner. I suggest for relative-difference evaluation that you consult an experimental psychologist and devise a consistent test to be used for all devices of a given class. This would cost you some time and money, but your readers pay for accurate information.
The promo for your magazine offered the first copy with no obligation. Please cancel my subscription.—Michael A. Morris, OD, Oak Brook, IL
Larry Archibald Comments:
JA answers these letters in more detail, but I must say that the objections I raised at the time, to the "blind A/B testing" we were alleged to have carried out in London, are only confirmed.
Consider: for 26 years now, J. Gordon Holt hooks up components and listens to them, and goes to great extremes to separate out the behavior of those components from that of his reference system. He describes what he hears. For 26 years people have read what he says, no doubt employing several grains of salt as they see fit, based on their prior experience of his reviews. Their reaction to what he writes must not be too bad, because they keep buying the magazine he writes. Not once in this entire time is a successful or meaningful allegation made with respect to his integrity or with respect to any political alignments he may have.
Now, after having "proved" his lack of malice for ARC by virtually drooling over their SP11 (and subsequently by calling their M300 the best amplifier he's ever heard), JGH is made out to be "political" and of "having a grudge against Audio Research Corporation." And, then, to add insult to injury, we are taken to task for minimizing the importance of a double-blind test we didn't set up, didn't want to take, and don't believe in anyway! This is unjust! All I can say is that it's comforting to have the pages of a magazine in which to discuss it all, and express my outrage.—Larry Archibald
John Atkinson comments:
Taking the major points made above, I must point out that politics never enter into a review published in Stereophile. All our writers are instructed to honestly report on what they hear, regardless of the consequences. To do otherwise—for example, to have given the SP9 a good review because of ARC's track record, or the positive reports from other reviewers, or the contrary opinion of the manufacturer and ARC dealers, despite the evidence of our own ears—is not only dishonest, it lets Stereophile's readers down. They do not buy the magazine to read what we think we ought to have heard; nor do they buy the magazine to hear us deafly echo the sentiments of other reviewers (unless we agree with them, of course); they need to know what we actually did hear.
And as to whether JGH and I have a grudge against Audio Research, I would point out that I have been a happy user of Audio Research products for five years, currently owning an SP10 Mk.II preamplifier, while JGH gave the company's M300 power amplifier a very positive review in Vol.10 No.9.
To reassure Mr. Willoughby, there was no communication of any kind between JGH and myself during the blind test. With hindsight, I do agree that my use of the word "incontrovertible" was somewhat heavy-handed. Our combined scoring of 2 out of 8 could be due to chance, while JGH's individual scoring of 0 out of 4 identifications is not significant to the 95% level demanded by statisticians. As I indicated, it is possible to analyze the results presented on p.116 of Vol.10 No.8, on a basis of identifying when an unknown preamplifier was the same as or different from the previous one. If you lump JGH's and my results together, you get the following: out of six potential changes of unknown, we correctly identified five changes, JGH scoring 3 out of 3. Though still not "incontrovertible," under the adverse conditions of the test—remember that this was sprung on us at almost no notice—this led to my positive feelings of identification.
If JGH had continued with the test, and continued to misidentify the preamplifiers for another four trials, then "incontrovertible" would have been undoubtedly the appropriate word to use. Yes, I would have liked to carry out more trials. However, as anyone who has taken part in blind equipment testing will agree, the listeners get fatigued very quickly. To have continued with the testing at another time was not possible, considering the fact that the test was not under our control, and the venue was a public hi-fi show on the very last day.
And one thing that discussion of the results of this blind test—hurriedly organized by a representative of Audio Research—does not alter is JGH's and my original findings: that our (early) SP9 sample failed to achieve the musical performance required for it to be recommended, regardless of politics. In my opinion, Audio Research produced an inexpensive derivative of the SP11 by keeping the superb technical performance but compromising the sound quality, endowing it with a more forward, less universally gratifying balance. My preference would have been to have kept as much as possible of the SP11's sound quality, but to have compromised the technical performance—the noise floor on the phono inputs, say—in order to have produced a true hybrid successor to the always musical, all-tube SP8.—John Atkinson