## The Highs & Lows of Double-Blind Testing Page 8

*increased*statistical power. This was explained in my letter which said, "...the probability of finding an audible difference, referred to as 'power,' equals 1 minus the probability of Type 2 error." It follows, therefore, that a reduction in the probability of Type 2 error will always be accompanied by an increase in statistical power.

A reader who cannot decide who is correct on this matter is advised to consult any textbook on elementary statistics.

**Type 2 Error and Concluding that Differences are Inaudible (footnote 3): **When listening data fail to reach statistical significance, what conclusion should the investigator reach? Investigators may reasonably differ. Some will conclude that differences are inaudible. Mr. Clark states, "...we never formally conclude that any difference is inaudible." (The underlined quote from the McIntosh review above seems to me to reach the very conclusion that Mr. Clark says he never reaches. I am content for the reader to make his or her own decision.)

The importance of Mr. Clark arguing that he never concludes that differences are inaudible is this: in my letter I characterized a Type 2 error as concluding that differences are inaudible when they are, in fact, audible. So, if one never concludes that differences are inaudible, then one cannot make a Type 2 error. Hence, Mr. Clark is not interested in Type 2 error probabilities in the table from my paper because they are not, according to him, relevant to his research.

So, when data are nonsignificant, what does Mr. Clark conclude? He says, "We may make some informal statements of our opinions or we may issue a challenge..." I'm not sure what all that means so it is fortunate that, statistically, it doesn't matter! To see why it doesn't matter, and how Type 2 error applies to Clark's strategy, we need to go beyond the nontechnical characterization of Type 2 error presented in my letter to the formal definition of Type 2 error presented in my paper.

Type 2 error is formally defined as not rejecting the null hypothesis (often called the test hypothesis) when it is false. The decision not to reject the null hypothesis would be made when the data are not statistically significant. (You can see why I avoided this formal definition in my letter.) The point is that Type 2 error obtains its formal meaning within the context of a statistical model, a model which must be "interpreted" in order to use it to analyze an experiment. So it is possible to come up with different "interpretations."

When data are nonsignificant, one scientist may conclude that differences are inaudible, another may conclude that it is wiser to withhold judgment (because, for example, it is always possible that ancillary equipment used in the listening test masked otherwise audible differences), another may decide to issue challenges, and a fourth scientist may decide to have spare ribs for dinner. These four scientists, having decided what interpretation to make when listening data are nonsignificant, may be interested in the probability that their significance test will label data as nonsignificant when differences are audible, forcing them to make that interpretation rather than correctly conclude that differences are audible.

For example:

• Scientist 1 wants to know the risk of concluding that differences are inaudible when differences are, in fact, audible.

•Scientist 2 wants to know the risk that he will withhold judgment when differences are, in fact, audible.

•Scientist 3 wants to know the risk that he will issue challenges when differences are, in fact, audible.

•Scientist 4 is beneath contempt because he is eating spare ribs while I am hungry and writing this damn letter!

All four risks above will be exactly equal in size. They may be viewed as various scientific interpretations of Type 2 error because, in each case, the differences are audible but nonsignificant results force the scientist to do something other than conclude that differences are audible. But whether or not we use the label "Type 2 error" for the four risks above isn't really important. What is important is whether the scientist or the reader of the scientist's research report would like to know those risks and how they can be found.

For example, I should think Mr. Clark would like to know the risk that his statistical analysis will label listening test data as nonsignificant when differences are audible and thereby force him to mumble incantations or issue challenges rather than correctly conclude that differences were audible. To please Mr. Clark, however, I won't label this risk "Type 2 error." But if Mr. Clark or a reader wished to know the exact size of this risk, they can look it up in the table in my paper, under the heading "Type 2 Error."

So, when data are nonsignificant, what does Mr. Clark conclude? He says, "We may make some informal statements of our opinions or we may issue a challenge..." I'm not sure what all that means so it is fortunate that, statistically, it doesn't matter! To see why it doesn't matter, and how Type 2 error applies to Clark's strategy (footnote 4), we need to go beyond the nontechnical characterization of Type 2 error presented in my letter to the formal definition of Type 2 error presented in my paper.

**Whose Statistic Is It Anyway?: **In one part of his letter, Mr. Clark refers to "Leventhal's statistics..." In another part he refers to "his (Leventhal's) analysis method." These quotes suggest that the statistical material regarding Type 2 error and power in my letter and in my paper was my own invention, foisted upon an innocent world. This is a fundamental misunderstanding. The concepts of Type 2 error and power are as basic and well known in statistics as Ohm's Law is in engineering, and what I attempted to convey was the well-known, mathematically derivable implications for Type 2 error and power of the statistical method *chosen by David Clark*.

It was Greenhill and Clark who chose to analyze their data with a statistical test of significance based on the binomial distribution and to adopt the 95% level of confidence (.05 level of significance), not me. (I would have analyzed the data with a "confidence interval," not a statistical test of significance. A confidence interval provides all the information provided by a significance test and then some. And it requires less statistical background to interpret properly.)

Footnote 3: Some of my summary of Mr. Clark's position in this section is based upon his correspondence with me and a brief conversation with him.

Footnote 4: Section 2.3 (Correct and Incorrect Decisions) of my paper also discusses this issue.

- Log in or register to post comments