The 2011 Richard C. Heyser Memorial Lecture: "Where Did the Negative Frequencies Go?" Measuring Sound Quality, The Art of Reviewing
Measuring Sound Quality
This is a table I prepared for my 1997 AES paper on measuring loudspeakers. On the left are the typical measurements I perform in my reviews; on the right are the areas of subjective judgment. It is immediately obvious that there is no direct mapping between any specific measurement and what we perceive. Not one of the parameters in the first column appears to bear any direct correlation with one of the subjective attributes in the second column. If, for example, an engineer needs to measure a loudspeaker's perceived "transparency," there isn't any single two- or three-dimensional graph that can be plotted to show "objective" performance parameters that correlate with the subjective attribute. Everything a loudspeaker does affects the concept of transparency to some degree or other. You need to examine all the measurements simultaneously.
This was touched on by Richard Heyser in his 1986 presentation to the London AES. While developing Time Delay Spectrometry, he became convinced that traditional measurements, where one parameter is plotted against another, fail to provide a complete picture of a component's sound quality. What we hear is a multidimensional array of information in which the whole is greater than the sum of the routinely measured parts.
And this is without considering that all the measurements listed examine changes in the voltage or pressure signals in just one of the information channels. Yet the defects of recording and reproduction systems affect not just one of those channels but both simultaneously. We measure in mono but listen in stereo, where such matters as directional unmaskingwhere the aberration appears to come from a different point in the soundstage than the acoustic model associated with it, thus making it more audible than a mono-dimensional measurement would predictcan have a significant effect. (This was a subject discussed by Richard Heyser.)
Most important, the audible effect of measurable defects is not heard as their direct effect on the signals but as changes in the perceived character of the oh-so-fragile acoustic models. And that is without considering the higher-order constructs that concern the music that those acoustic models convey, and the even higher-order constructs involving the listener's relationship to the musical message. The engineer measures changes in a voltage or pressure wave; the listener is concerned with abstractions based on constructs based on models!
Again, this was something I first heard described by Richard Heyser in 1986. He gave, as an example of these layers of abstraction, something with which we are all familiar yet cannot be measured: the concept of "Chopin-ness." Any music student can churn out a piece of music which a human listener will recognize as being similar to what Chopin would have written; it is hard to conceive of a set of audio measurements that a computer could use to come to the same conclusion.
Once you are concerned with a model-based view of sound quality, this leads to the realization that the nature of what a component does wrong is of greater importance than the level of what it does wrong: 1% of one kind of distortion can be innocuous, even musically appropriate, whereas 0.01% of a different kind of distortion can be musical anathema.
Consider the sounds of the clarinet I was playing in that 1975 album track. You hear it unambiguously as a clarinet, which means that enough of the small wrinkles in its original live sound that identify it as a clarinet are preserved by the recording and playback systems. Without those wrinkles in the sound, you would be unable to perceive that a clarinet was playing at that point in the music, yet those wrinkles represent a tiny proportion of the total energy that reaches your ears. System distortions that may be thought to be inconsequential compared with the total sound level can become enormously significant when referenced to the stereo signal's "clarinet-ness" content, if you will: the only way to judge whether or not they are significant is to listen.
But what if you are not familiar with the sound of the clarinet? From the acoustic-modelbased view, it seems self-evident that the listener can construct an internal model only from what he or she is already familiar with. When the listener is presented with truly novel data, the internal models lose contact with reality. For example, in 1915 Edison conducted a live vs recorded demonstration between the live voice of soprano Anna Case and his Diamond Disc Phonograph. To everyone's surprise, reported Ms. Case, "Everybody, including myself, was astonished to find that it was impossible to distinguish between my own voice, and Mr. Edison's re-creation of it."
Much later, Anna Case admitted that she had toned down her voice to better match the phonograph. Still, the point is not that those early audiophiles were hard of hearing or just plain dumb, but that, without prior experience of the phonograph, the failings we would now find so obvious just didn't fit into the acoustic model those listeners were constructing of Ms. Case's voice.
I had a similar experience back in early 1983, when I was auditioning an early orchestral CD with the late Raymond Cooke, founder of KEF. I remarked that the CD sounded pretty good to meno surface noise or tracing distortion, the speed stability, the clarity of the low frequencieswhen Raymond metaphorically shook me by the shoulders: "Can't you hear that quality of high frequencies? It sounds like grains of rice being dropped onto a taut paper sheet." And up to that point, no, I had not noticed anything amiss with the high frequencies (footnote 3). My internal models were based on my decades of experience of listening to LPs. I had yet to learn the signature of the PCM system's failingsall I heard was the absence of the all-too-familiar failings of the LP. Until Raymond opened the door for me, I had no means of constructing a model that allowed for the failings of the CD medium.
An apparently opposite example: In a public lecture in November 1982, I played both an all-digital CD of Rimsky-Korsakov's Scheherazade and Beecham's 1957 LP with the Royal Philharmonic Orchestra of the same work, without telling the audience which was which. (Actually, to avoid the "Clever Hans" effect, an assistant behind a curtain played the discs.) When I asked the listeners to tell me, by a show of hands, which they thought was the CD, they overwhelmingly voted for what turned out to be the analog LP as being the sound of the brave new digital world!
I went home puzzled by the conflict between what I knew must be the superior medium and what the audience preferred. Of course, the LP is based on an elegant concept: RIAA equalization. As Bob Stuart has explained, this results in the LP having better resolution than CD where it is most importantin the presence region, where the ear is most sensitive but not as good where it doesn't matter, in the top or bottom octaves. But with hindsight, it was clear that I had asked the wrong question: instead of asking what the listeners had preferred, I had asked them to identify which they thought was the new medium. They had voted for the presentation with which they were most familiar, that had allowed them to more easily construct their internal models, and that ease had led them to the wrong conclusion.
When people say they like or dislike what they are hearing, therefore, you can't discard this information, or say that their preference is wrong. The listeners are describing the fundamental state of their internal constructs, and that is real, if not always useful, data. This makes audio testing very complex, particularly when you consider that the brain will construct those internal acoustic models with incomplete data (footnote 4).
So how do you test the effectiveness of how changing the external stimulus facilitates the construction of those internal models?
In his keynote address at the London AES Conference in 2007, for example, Peter Craven discussed the improvement in sound quality of a digital transfer a 78rpm disc of a live electrical recording of an aria from Puccini's La Bohème when the sample rate was increased from 44.1 to 192kHz. Even 16-bit PCM is overkill for the 1926 recording's limited dynamic range, and though the original's bandwidth was surprisingly wide, given its vintage, 44.1kHz sampling would be more than enough to capture everything in the music, according to conventional information theory.
But as Peter pointed out, with such a recording there is more to the sound than only the music. Specifically, there is the surface noise of the original shellac disc. The improvement in sound quality resulting from the use of a high-sampling-rate transfer involved this noise appearing to float more free of the music; with lower sample rates, it sounded more integrated into the music, and thus degraded it more.
Peter offered a hypothesis to explain this perception: "the ear as detective." "A police detective searches for clues in the evidence; the ear/brain searches for cues in the recording," he explained, referring to the Barry Blesser paper I mentioned earlier. Given that audio reproduction is, almost by definition, "partial input," Peter wondered whether the reason listeners respond positively to higher sample rates and greater bit depths is that these better preserve the cues that aid listeners in the creation of internal models of what they perceive. If that is so, then it becomes easier for listeners to distinguish between desired acoustic objects (the music) and unwanted objects (noise and distortion). And if these can be more easily differentiated, they can then be more easily ignored.
Once you have wrapped your head around the internal-modelbased view of perception, it becomes clear why quick-switched blind testing so often produces null results. Such blind tests can differentiate between sounds, but they are not efficient at differentiating the quality of the first-, second-, and third-order internal constructs outlined earlier, particularly if the listener is not in control of the switch.
I'll give an example: Your partner has the TV's remote control; your partner flashes up the program guide, but before you can make sense of the screen, she scrolls down, leaving you confused. And so on. In other words, you have been presented with a sensory stimulus, but have not been given enough time to form the appropriate internal model. Many of the blind tests in which I have participated echo this problem: The proctor switches faster than you have time to form a model, which in the end results in a result that is no different from chance.
The fact that the listener is therefore in a different state of mind in a quick-switched blind test than he would be when listening to music becomes a significant interfering variable. Rigorous blind testing, if it is to produce valid results, thus becomes a lengthy and time-consuming affair using listeners who are experienced and comfortable with the test procedure.
There is also the problem that when it comes to forming an internal model, everything matters, including the listener's cultural expectations and experience of the test itself. The listener in a blind test develops expectations based on previous trials, and the test designer needs to take those expectations into account.
For example, in 1989 I organized a large-scale blind comparison of two amplifiers using the attendees at a Stereophile Hi-Fi Show as my listeners. We carried out 56 tests, each of which would consist of seven forced-choice A/B-type comparisons in which the amplifiers would be Same or Different. To decide the Sames and Differents, I used a random number generator. However, if you think about this, sequences where there are seven Sames or Differents in a row will not be uncommon. Concerned that, presented with such a sequence, my listeners would stop trusting their ears and start to guess, whenever the random number generator indicated that a session of seven presentations should be six or seven consecutive Differents or Sames, I discarded it. Think about it: If you took part in a listening test and you got seven presentations where the amplifiers appeared to be the same, wouldn't you start to doubt what you were hearing?
I felt it important to reduce this history effect in each test. However, this inadvertently subjected the listeners to more Differents than Sames224 vs 168which I didn't realize until the weekend's worth of tests was over. As critics pointed out, this in itself became an interfering variable.
The best blind test, therefore, is when the listener is not aware he is taking part in a test. A mindwipe before each trial, if not actually illegal, would inconvenience the listenerswhat would you do with the army of zombies that you had created?but an elegant test of hi-rez digital performed by Philip Hobbs at the 2007 AES Conference in London achieved just this goal.
To cut a long story short, the listeners in Hobbs's test believed that they were being given a straightforward demo of his hi-rez digital recordings. However, while the music started out at 24-bit word lengths and 88.2kHz sample rates, it was sequentially degraded while preserving the format until, at the end, we were listening to a 16-bit MP3 version sampled at 44.1kHz at a 192kbps bit rate.
This was a cannily designed test. Not only was the fact that it was a test concealed from the listeners, but organizing the presentation so that the best-sounding version of the data was heard first, followed by progressively degraded versions, worked against the usual tendency of listeners to a strange system in a strange room: to increasingly like the sound the more they hear of it. The listeners in Philip's demo would thus become aware of their own cognitive dissonance. Which, indeed, we did.
Philip's test worked with his listeners' internal models, not with the sound, which is why I felt it elegant. And, as a publisher and writer of audio component reviews, I am interested only peripherally in "sound" as such (footnote 5); what matters more is the quality of the reviewer's internal constructs. And how do you test the quality of those constructs?
The Art of Reviewing
That 1982 test of preference of LP vs CD forced me to examine what exactly it is that reviewers do. When people say they like something, they are being true to their feelings, and that like or dislike cannot be falsified by someone else's incomplete description of "reality." My fundamental approach to reviewing since then has been to, in effect, have the reviewer answer the binary question "Do you like this component, yes or no?" Of course, he is then obliged to support that answer. I insist that my reviewers include all relevant information, as, as I have said, when it comes to someone's ability to construct his or her internal model of the world outside, everything matters.
For example: in a recent study of wine evaluation, when people were told they were drinking expensive wine, they didn't just say they liked it more than the same wine when they were told it was cheap; brain scans showed that the pleasure centers of their brains lit up more. Some have interpreted the results of this study as meaning that the subjects were being snobsthat they decided that if the wine cost more, it must be better. But what I found interesting about this study was that this wasn't a conscious decision; instead, the low-level functioning of the subjects' brains was affected by their knowledge of the price. In other words, the perceptive process itself was being changed. When it comes to perception, everything matters, nothing can safely be discarded.
In my twin careers in publishing and recorded music, the goal is to produce something that people will want to buy. This is not pandering, but a reality of lifeif you produce something that is theoretically perfect, but no one wants it or appreciates it enough to fork over their hard-earned cash, you become locked in a solipsistic bubble. The problem is that you can't persuade people that they are wrong to dislike something. Instead, you have to find out why they like or dislike something. Perhaps there is something you have overlooked.
For the second part of this lecture, I will examine some "case studies" in which the perception doesn't turn out as expected from theory. I will start with recording and microphone techniques, an area in which I began as a dyed-in-the-wool purist, and have since become more pragmatic.
Footnote 3: For a long time, I've felt that the difference between an "objectivist" and a "subjectivist" is that the latter has had, at one time in his or her life, a mentor who could show them what to listen for. Raymond was just one of the many from whom I learned what to listen for.
Footnote 4: This is a familiar problem in publishing, where it is well known that the writer of an article will be that article's worst proofreader. The author knows what he meant to write and what he meant to say, and will actually perceive words to be there that are not there, and miss words that are there but shouldn't be. The ideal proofreader is someone with no preconceptions of what the article is supposed to say.
Footnote 5: My use of the word sound here is meant to describe the properties of the stimulus. But strictly speaking, sound implies the existence of an observer. As the philosophical saw asks, "If a tree falls in the forest without anyone to observe it falling, does it make a sound?" Siegfried Linkwitz offered the best answer to this question on his website: "If a tree falls in the forest, does it make any sound? No, except when a person is nearby that interprets the change in air particle movement at his/her ear drums as sound coming from a falling tree. Perception takes place in the brain in response to changing electrical stimuli coming from the inner ears. Patterns are matched in the brain. If the person has never heard or seen a tree falling, they are not likely to identify the sound. There is no memory to compare the electrical stimuli to."