Moulton Laboratories
the art and science of sound
The Wacky World of Blind Testing
Dave Moulton
November 1999

An introduction to reducing bias in our sound judgments.

The Wacky World of Blind Testing

Alert readers will recall that last month I began a discussion of how we try to determine what it is that us humans can really hear. I noted that people often report hearing “awesome” differences between small changes (like 16 to 20 bits, for instance), but that that when they are tested in a “blind” way, they can’t seem to make out the difference at all. How can this be? This month, we’ll take a quick look at blind testing.

To begin with, us humans are quirky creatures. Our conscious and subconscious minds behave in unruly and pesky ways. Our perceptions of the world around us are multi-faceted, complex, cross-integrated, extraordinarily rich in information, and also heavily “edited.” In fact, our raw perceptual information is overwhelmingly complex and noisy, and some sort of perceptual “editing” is necessary to keep us from drowning in sensory noise.

With the editing that occurs at various neurological and pre-cognitive stages of our perception, however, we also experience a loss of, well, objectivity. Our edited perceptions tell us quickly what we think we need to know. Such perceptions take into account ALL of what we think we need to know about the stimulus in question. Sometimes these edited perceptions get us into trouble.

To give you an example, let’s go through the following exercise (this actually comes from a test I did, but with the names changed slightly to protect both innocent and guilty). Say, out loud, the following sentence:
Compared to Microphone B, Microphone A sounds really dull and lifeless.
Not very hard to say, is it? Anybody could say that, and mean it, too. Now, let’s try another sentence. Say this one, loud and clear (I dare you!):
Compared to a Shure SM-58, a Neumann U-49 sounds really dull and lifeless.
A little harder to say, isn’t it? Especially if you have a knowledgeable colleague within earshot.

Why is this?

It is because we “know,” as a matter of professional competence, that Neumann microphones are better-sounding than Shure microphones. It is professionally incompetent to say otherwise. (Ah, the power of brand identities . . .)

How do we know? From our own experience with microphones (whether or not we’ve ever directly compared a Shure to a Neumann), gossip, group-think, perceived value, relative cost, mythology, etc. This sum total of “knowledge” helps us quickly make the acceptable professional decisions which allow us to survive.

This sum total of “knowledge” is also the basis for bias and prejudice. For better and worse, it is lacking in what we like to think of as scientific rigor and objectivity.

Welcome to blind testing.

When we wish to determine the answer to the question: “All other things being equal, which microphone sounds better to human listeners, a Shure SM-58 or a Neumann U-49?”, the “all other things being equal” part of the question (which is seldom stated, but always implied) requires that we take steps to make sure all other things ARE equal when we perform an experiment that compares the two microphones. Because we already “know” that Neumanns are better than Shures, that knowledge colors our perceptions.

How strong is this effect? Overwhelming might not be too strong a word. In 1994, Floyd Toole published a study of this effect which found, in a study of loudspeakers, that brand identity was the strongest force affecting perceived audio quality, far stronger than audio performance.

So, we conceal the identity of the devices under test, and rename them with neutral names such as Microphone A and Microphone B. This is what we mean by “blind testing.” The test listener should not know the identity of the items under test. Then, when he or she reports that, “compared to Microphone A, Microphone B sounds really dull and lifeless,” we can reasonably assume that he or she is not drawing on other, prejudicial "experience" in making that judgement.

Sadly, this isn’t enough. It turns out that us humans are pernicious and sneaky enough to screw things up anyway. If the person administering the listening test knows the identity of the two microphones, he or she may, by either inadvertent or advertent conduct, cause the blind listener to prefer one microphone over the other. So, if we are going to be thorough about this, we also need to conceal the identity of the microphones from the person administering the test as well. This is what we mean by “double-blind testing.”

So, if we are going to rule out prejudices and biases, however well-founded they might be, we need to use double-blind testing, or so it seems. This requires a lot of complexity and expense, but at least we’ve found out a truth that stands up “when all other things are equal,” which is a cornerstone of scientific method.

Unfortunately, this all gets a little tougher when we start trying to measure REALLY small differences, or trying to find out if there is an audible difference at all. When we compare two microphones, the differences are generally pretty large and at least reasonably obvious. both by objective measurement and perceived sound quality. When we go digging for the real or imagined differences between, say, 20 and 24-bit, or two audio cables, the acoustical or electronic measured objective differences are really very small. And with such small differences, the complexity of the blind test begins to actually affect the results.

It boils down to this obvious but inescapable fact: it is harder to correctly answer questions whose answers we don’t know than questions whose answers we do know. Setting aside the obvious issues of prejudice, bias and cheating for a moment, we will get “correct” answers more often when we “know” the answers than when we don’t. I’ve seen this effect a lot when doing my Golden Ears seminars (I publish a set of audio ear training CDs called “Golden Ears,” and often present ear-training seminars using them). Listeners asked to identify the difference between two versions of the same recorded excerpt will have real trouble, at first, hearing that one version is 3 dB louder than the other. Once they are told and shown that such a difference exists, they find it “obvious.”

So, when we try to measure really small differences, we can reasonably expect to find (and do in fact find) that blind (and double blind) tests yield more negative results than non-blind tests, due both to bias effects and also to the confidence effect of “knowing’ the answer. The insight is that blind tests are “harder” than “sighted” tests.

Interestingly, this doesn’t mean that blind tests are necessarily more rigorous than sighted tests. Because they represent a testing context that is comparatively more difficult, and because they represent a listening situation that is different from the end-use situation where we wish to apply our findings, the findings from blind tests may not prove to be perfectly reproducible or relevant. What we CAN be sure of is that BLIND TESTS ARE NOT CONTAMINATED BY BIAS EFFECTS. Usually, that benefit more than outweighs the small error caused by the “lack of confidence” effect.

So, for these very small differences, I personally take the pragmatic (and lazy) approach. I use blind testing, and figure any errors due to “loss of confidence” are so small that they aren’t worth worrying about. It generally works pretty well, and relieves me from the problem of having to account for the small problem it causes. Such pragmatism is widely accepted in the testing community, because the alternative – sorting out the bias effect present in sighted tests – is prohibitively expensive and time consuming, if it is even possible.

So, I recommend that you depend on blind (or better, double blind) testing to find out answers to questions about the audibility of effects like 96 kHz. sampling rates or 24-bit words.

In the next issue, we’ll look at our curious choice of words like “amazing” to describe small differences that we can barely hear. Thanks for listening.
Note: The following group of columns that I wrote for TV Technology are an attempt on my part to describe some of the issues surrounding our attempts to measure and evaluate the audibility of high-resolution formats. Together, I think they make an excellent short survey of these issues. I hope you find them useful.

COMMENTS

Panyu, China     Mar 25, 2012 11:10 AM
Hello Dave,
I enjoyed your post about The Wacky World of Blind Testing. There are very few personalities (living mortals) behind a name brand. And even fewer persons who know their stuff. I'm happy you are here and offering intelligent diction and substance to audio, Thank you.

Philip Richardson
Philip Richardson 

Post a Comment



rss2

rss atom