Part II
Last month, we began considering one of the weirdnesses that underlie blind tests, chance. We noted that in order for a test finding to be distinguishable from chance, whether or not we KNOW we can hear the effect, the test results we get must be really improbable through chance selection. We observed that this is fairly easy to do when we run a lot of test trials, like 10,000 or so. We finished up by noting that in tests with small numbers of trials, it gets a lot harder to separate chance from positive identification, because the two realms often overlap, especially when listening for the kind of small differences where we are likely to make a fair number of mistakes even when we can usually pick out the difference.
We also observed that the study of probabilities lives in a dismal swamp of confusion. The truth about that dismal swamp is that while it is quite predictable in the long term, it can be quite variable in the short term. So, we have to expect, in the short term, that those dumb luck answers may very well appear to be the result of Golden Ears.
How can we tell the difference between the two?
Statisticians use a range called the “standard error” to describe this. Without making you actually wade through the math mud of this particular bog, suffice it to say that a “standard error” describes something about the probable range of test scores coming out correctly by dumb luck for any given number of trials. Skluck! Sheesh!
For “one” standard error, we can say that there is a 67% probability that the number of “correct” scores occurring by chance in ten trials will fall between 34% and 66%. This means that it is “probable” (two chances in three) that any range of correct scores between three and seven could be due to dumb luck (i.e. indistinguishable from chance), and it is “mildly improbable” (one chance in three) that correct scores greater than seven out of ten would be due to dumb luck. Skluck! Whew!
It should be obvious that we can never be “completely sure” that any given score ISN’T due to dumb luck. However, we’d like to have reasonable “confidence” that dumb luck isn’t the basis of our finding of audibility.
So we raise the bar. We look for the range of scores we can be reasonably sure isn’t due to dumb luck. “Two” standard errors yields a 95% probability that the number of “correct” scores occurring by chance in ten trials will fall between 18% and 82%. This means that for a score greater than 82% in such a test, we can be 95% sure it isn’t dumb luck. But note that as we established more confidence in our findings by going to two standard errors, we had to accept a greater range of values, 18-82% instead of 34-66%, that could be dumb luck. Skluck! Dayumn!
As the number of trials change, these ranges change. For five trials, two standard errors (95% confidence) yield a dumb luck range essentially between 0 and 5, meaning that there is no set of answers for five trials that is indistinguishable from chance with 95% confidence! Skluck! Yikes!
For twenty trials, the dumb luck range (with 95% confidence) is from six to fourteen correct answers, or 30% to 70% (28% – 72%, actually). For 100 trials, the dumb luck range is 40% - 60%.
In general, statisticians would like the probability of dumb luck to be less than 5%, which is to say, that we would like to be at least 95% sure that it ISN’T chance. Hence, it has become the norm to use two standard errors to determine the range of answers that we will accept as being “possible due to dumb luck,” even though some of those answers may be improbable. Skluck! Glub!
So let’s go over the implications of those probabilities once again, and then I’ll leave you in peace until the final exam (isn’t this stuff fun?).
One of the things we can count on is that in any trial set it is probable that the number of correct answers obtained by guessing or chance will diverge from 50%. This means that in any group of trials, it is probable, by chance, that a certain range of correct answers will occur. If we do four trials, in 67% of the sets of trials the correct answers due to guessing may range from 1-3, and in 95% of the trial sets, from 0 to 4! This means that we can never be 95% sure, using only four trials, that our acing of the ABX trials is indistinguishable from chance, in a statistically reliable and robust sense.
For ten trials, we can expect that in 67% of such trial sets, the number of correct answers we get by chance will fall between 7 and 3, and for 95% of such groups, the number of correct answers that we get by dumb luck will fall between 2 and 8. Uh-oh!
Catch this! If we get 8 out of 10 answers right, it may still be by chance. It is possible, by dumb luck only, that we will get up to 8 out of 10 answers correct, in a way that will make statisticians say, “Yeah, that happens often enough, one time out of twenty, anyway, that you can’t rule it out.”
If we go to 100 trials, it gets a little better. Now, the range of possible correct scores in 67% of the trials is between 45 and 55%, and in 95% of the trials sets is only 40-60%. If we get eighty out of one hundred test trials right, it is clearly distinguishable from chance in a way that is not possible when we only do ten trials and get eight of ‘em right.
And if we go to 10,000 trials (now we’re getting somewhere, in this dismal swamp), if we get 53% right, it is clearly distinguishable from chance.
So, beware of those quickie little informal listening tests, where your Golden Ears can pick it out just fine five times in a row, or eight times out of ten. You may be positive you heard it, but it still could be just plain ol’ dumb luck! Ya never know. And that’s the problem!
Thanks for listening. Skluck!
comments: (0)