Moulton Laboratories
the art and science of sound
Subjective Testing in Your Own Home Studio
Originally published in Recording, approx. August 2001
by Dave Moulton
August 2001
2. Understanding the results

So You Wanna Find Out If You Can Hear 24/96 Audio

< 1 2

Understanding The Results

In theory, referring back to the Differentiation Test, if you can “hear” a difference, all identifications of X should be correct. If you can’t “hear” the difference, you should get “approximately” half the answers correct (which is the same as you would get if you simply determined your answers by flipping a coin – this is what is known as “indistinguishable from chance”).

In reality, you seldom get all the answers right (this is a fundamental characteristic of doing a cognitive task while being human) and unless you do a GREAT MANY trials (say, 10,000), you won’t get approximately 50% right either. This is where the dismal swamp called probabilities enters the scene. In order to make sense of the data (for any score other than 100% or 50%, which are both unequivocal), you need to know a little more.

Without bothering you with the reasoning or the math, I can simply tell you that when you have twenty-five trials where you can either be “right” or “wrong,” if you get more than 70% (18) “right,” than there is a 95% probability (which means it’ll happen this way 19 out of 20 times) that your answers are NOT due to chance. We usually call this “audible”, and it is the standard that is normally used to eliminate the possibility that the results are due to chance.

Meanwhile, if (with 25 trials) you get less than 70% but more than 60% (15 - 17) right, then it is “probable” that your results are not chance (there is at least a 67% probability, or 2 chances out of three). Generally, this is regarded as shaky enough terrain that we don’t like to claim that it DEFINITELY ISN’T chance, even though it “probably” isn’t. The standard interpretation of such results is that the effect IS NOT PROVED to be “audible,” even though it probably is.

If you score less than 60% correct (with 25 trials, remember), then we say “your results are indistinguishable from chance.” This means that there is a significant possibility, even a probability, that you were not able to hear a difference. The standard interpretation of such results is that the effect is not “audible.”

Here is a table showing the number of right answers needed for various numbers of trials to reach the various standards.

67% confidence, or
“probably but not proved to be audible”
95% confidence, or
“highly probable and considered proved to be audible”
#trials# right/% right# right/% right
107/66%9/82%
2515/60%18/70%
5029/57%32/64%
7542/56%47/62%
10055/55%60/60%
1,000516/52%532/53%
10,0005050/51%5100/51%


Now you can see why I insisted on 25 trials. If you only do ten trials, you have to get 9 right to establish “audibility.” Also note that as the number of trials go up the percentage of right answers needed to establish audibility drops, reaching 51% at 10,000 trials. At 100 trials it is down to 60%, while at 10 trials it was at 90%. Happily, if YTA can be persuaded to cut another set of trials for you to do, you can add together the two sets and treat the test as having 50 trials.

When you do the Preference Test, you scan score the Preference Test similarly, and treat the probability of the Preference being real as a function of that score. Let’s suppose that out of 25 trials, the DUT is preferred 65% of the time. This means that it is probable but not proven that the DUT will be preferred in general. If it is preferred more than 70% of the time, then you can be pretty much assured that it will be preferred, and if it scores less than 60%, you can assume that there is no significant preference for it.

You can’t use any of these statistics for the Quality Test, except to assess by the scores on the Preference Test how probable it is that the Quality Ratings are meaningful. For the Quality Test, you simply average the scores of each DUT and the Ref. Those relative scores give you only a very rough indication of the perceived quality of each device. For them to be significant, I’d expect to see a whole rating point difference. Us testing folks use a batch of math tricks to make those ratings more reliable, but that is way beyond this article.

What You Can Claim You Know From This, and What You Can’t

Here’s where it all gets to be fun and enlightening! After you’ve done all the arithmetic, you and YTA get to pop a coupla brewskis, sit back and contemplate the meaning of it all. Let’s say you scored 62% right on the ABX test, 65% Preference for the Ref on the Preference Test, and on the Quality Rating Test the DUT got 4.6 and the Ref got 4.4.

The first thing is to remember is that it isn’t YOU that got things right or wrong – it’s the DUT. You’re measuring the DUT, not you. You are simply the test instrument!

The second thing is that these results are more than a little ambiguous.

The third thing is to keep in mind that MOST such results are going to be ambiguous. If they were unequivocal, you probably wouldn’t feel the need to do the test!

In this case, it appears that the DUT (with your 62% correct IDs) is probably audible – you can probably hear a difference between the DUT and Ref. However, you can’t prove it, except by piling up more trials, maybe a lot more. This means you can claim with some real scientific authority, “In my studio, 24/96 audio is probably audible.” You can’t claim (at least scientifically), “In my studio, 24/96 audio is DEFINITELY audible.”

The next result is a puzzler. You seem to prefer the Ref (it was preferred 65% of the time in 25 trials)! How can this be? Two things:

First, the result is inconclusive – a score of 65% suggests a possible preference, not a definite one.

Second, in combination with the ambiguous finding about whether or not you actually HEARD a difference or not, you have to suspect that such a preference is fairly tenuous.

Third, it may very well be that some aspect of 16/44 audio causes it to be more attractive to you than 24/96. This isn’t bad, it just is. (For instance, producers often prefer truncation to redithering in mastering – the sound seems a little hotter, edgier, or so they say.) This possibility also supports the notion that there may be an audible difference after all.

So, what you have to say is, “In my studio, I’ve found I have a possible slight preference for 16/44 audio, but the finding is inconclusive.” What you can’t say is, “In my studio, I definitely prefer 16/44.”

Finally, looking at the Quality Rating, you find that the DUT scored a little higher (4.6) than the Ref (4.4). Now these ratings are very close, and without considerably more math work, we aren’t going to be able to claim there is much difference if any between the scores. The fact that the DUT scored higher is not terribly significant, especially in light of the Preference Score – these two scores conflict, and suggest in combination that the Preference Score may be due to chance.

What you have to say is, “16/44 and 24/96 audio received similar quality ratings in my studio.” What you can’t say is, “In my studio, 24/96 is definitely of higher quality.”

In summary, you can say the following, with fairly scientific authority, “In an array of tests in my studio, I found that there was probably an audible difference between 16/44 and 24/96 audio. The Preference and Quality Rating scores, taken together, suggest that there is little perceived difference in quality between them, if any.”

This finding should prove to be reliable and reproducible for you, in your room. That’s all. You can’t apply this finding to other studios or other listeners. It may be true elsewhere, but you have not developed the evidence to support it.

Also, keep in mind that your test protocols are still quite informal, and that your results must be regarded as tentative and anecdotal. The possibility of errors is large, and many problems may crop up in the way the tests ran. Be humble, and always hedge your bets and claims!

Fun and Profit in the New Millennium

Let’s cut to the chase. What should be REALLY important here is that YOU made the measurement in YOUR studio, so it should be valid for YOU AND YOUR STUDIO.

Even more important is that YOU did the listening. Trust me on this – there is no way, until you’ve ACTUALLY DONE this sort of listening, that you can imagine how enlightening such listening is. Forget the scores for a moment. In the 75 trials and three hours of controlled blind listening you’ve done, you’ve just gotten a graduate course in critical listening! You now know, IN YOUR OWN EARS, just how good 24/96 is, relative to 16/44, subject to a couple of limitations.

Limitation #1 is that you only measured one DUT. It is possible that something in the DUT other than its resolution affected the results.

Limitation #2 is that your results are limited by (a) the quality of your microphone and (b) the quality of your monitors and control room. If you change those, you may get different results.

It doesn’t matter much what other people say about what they’ve heard in other circumstances. If it’s any comfort, I’ve found that very little of this sort of controlled blind listening goes on in the industry, either in studios or manufacturers’ labs. There are a couple of reasons for this.

First, if we want the results to be reliable for other listeners in other spaces (a requirement you have avoided in our little exercise here – and that’s why it’s so cheap) the testing protocol and its attendant analysis becomes EXTREMELY difficult and expensive.

The second reason is a little sadder but equally true: we generally DON’T WANT TO KNOW THE TRUTH in regard to these things. We would prefer NOT to report “I can’t hear a difference” if somebody else is claiming they can! This is known as The Emperor’s New Stereo Syndrome.

What you’ve done, very cheaply and simply, is step beyond this. By doing so you’ve acquired some serious knowledge power. You KNOW the unvarnished blind truth (which you can keep to yourself, if you want – no need to get all the sales reps riled up telling them the bad news). And you’ll discover a new-found voice of experience in the back of your mind speaking up, whenever somebody starts claiming how awesome some tiny effect is. That fresh voice of wisdom will calmly murmur to your conscious mind, “It sure sounds like they haven’t done the blind listening. This is probably all nonsense!”

And with that thought, you can quietly put your wallet back in your pocket and mosey off down the aisle, a little richer and a little wiser.

Happy blindfolds!

Dave Moulton likes to get blind on the weekends. You can complain to him about anything at moultonlabs.com.
< 1 2
Members
Login | Register
Mailing List

Post a Comment



rss2

rss atom