Do blind tests really prove small differences don't exist?

Ron Party · May 14, 2011

amirm said:
I am confused how this relates to the question you asked me. You asked does JND not apply. I said that if you use blind testing to establish your JND, then I can't use that as a assumption of what the JND is in reality outside of blind tests. Put another way, if I accept that JND has to be established with blind testing, then the next question of whether small differences are audible in blind tests is moot!

We need to establish a new protocol to get there. Using existing protocol doesn't work to prove or disprove the hypothesis at hand.

Oh no, you've got it backwards. I never asked JND does not apply. I asked whether YOU were saying it does not.

Ron Party said:
Amir are you not casting doubt on the whole concept of JND? [Emphasis added.]

If YOU posit that blind testing is not a reliable detector of JND, then YOU are rejecting the scientific method. YOU are rejecting rationalism:

amirm said:
I said that if you use blind testing to establish your JND, then I can't use that as a assumption of what the JND is in reality outside of blind tests.

This is an oxymoron. But it is a deeply profound one. You seem to be rejecting the scientific method, the most reliable method for understanding the perceptual world. More evidence of the same:

amirm said:
occasional blind tests to rule out placebo effect

In this theory you're advancing, it is only occasionally necessary to conduct blind testing to rule out placebo effect, as if placebo effect can go on holiday.

amirm · May 14, 2011

Ron Party said:
Oh no, you've got it backwards. I never asked JND does not apply. I asked whether YOU were saying it does not.

And I answered it did but that for it to work towards the theory we are discussing, it needs to use a different principal to measure it than blind tests.

If YOU posit that blind testing is not a reliable detector of JND, then YOU are rejecting the scientific method. YOU are rejecting rationalism:

We must be talking past each other Ron. This entire thread is investigating that possibility. That in blind tests, we underestimate small differences. That is a theory to be proven or disproved. I can't take the notion of JND measured in blind testing and use it as anything in this proof. Doing so means there was no reason to investigate at all.

This is an oxymoron. But it is a deeply profound one. You seem to be rejecting the scientific method, the most reliable method for understanding the perceptual world. More evidence of the same: In this theory you're advancing, it is only occasionally necessary to conduct blind testing to rule out placebo effect, as if placebo effect can go on holiday.

Please don't tie the two questions to together. They ere independent answers.

On the first, as I explained above, I am investigating -- not asserting -- how we can determine whether humans in blind tests under-report the differences heard.

The second was answering your assertion that maybe I had some new approach that the world needs to adopt. Turns out if you accept my theory, the solution is precisely what we are doing: using a mix of blind and non-blind testing. As for taking a vacation, it indeed does most of the time with trained listeners. This is why we use them and use them constantly in sighted evaluations.

andy_c · May 14, 2011

This is called special pleading.

JackD201 · May 14, 2011

Ron Party said:
A few obvious points. First, it is not circular, unless you subscribe to the notion that the scientific method itself is invalid, i.e., the perceptual world cannot be studied. If it can be studied, the scientific method is THE way.

Second, and I know you know this, but it must be emphasized that DBTs don't *prove* a difference does or does not exist. Don't lose yourself in your on-line battles with A.J. & Arny and lose sight of the basics.

Third, if you know of a better way to study the subject at hand which is even remotely as reliable, the entire scientific community is, well, all ears.

Amir in post #20 put in a few sentences what I tried to say in as many paragraphs. I sure have to learn how to be more concise.

BTW who are A.J and Arny anyway?

Actually Ron I am a very firm believer in the Scientific Method. This thread just points out however that many a time folks invoke it but don't follow it

It's the folks that make generalizations without doing the hard work that should come with it that I question. When someone tells me I can't possibly hear a difference and it turns out his basis for doing so is an informal test on a sample population the size of which he would have gotten a failing grade from his high school science teacher, I think you can see where I'm coming from.

Yes I do totally agree with you that DBTs don't prove or disprove a difference exists, only that under the conditions of the test, they either can or cannot be detected. I also must emphasize that while academic implications may not be earth shattering, the latter has great practical implications so in my view foregoing such tests in a development and evaluation process would be a mistake.

JackD201 · May 14, 2011

andy_c said:
This is called special pleading.

I don't know if this falls under special pleading.

For example the testers measured differences between two components. All the evidence is there. Now the control and the altered are sent off to be DBT'd and the results are that the test subjects couldn't tell the difference. Does this invalidate the measured results? No. It just says the subjects can't tell the difference. The logical and practical course of action would be to go build the one that is easier and/or less costly to produce or in the consumer's case, buy.

Ron Party · May 14, 2011

No, we're not posting past one another. You are rejecting the scientific method.

andy_c · May 14, 2011

JackD201 said:
For example the testers measured differences between two components. All the evidence is there.

It is evidence of measurable differences, but I assume we're talking audible differences. One could say the measurements are evidence of potential audible differences though.

This thread is a bit confusing to me because of its title, "Do blind tests really prove small differences don't exist?". Of course in general it's impossible to prove a negative as it were, so the answer "no" is almost a given. Actually, the title is a nonsequitur if one examines it critically, because the idea that one could "prove" that something which really does exist does not is nonsense. But it seems to me the intent of the discussion is to determine whether blind tests mask small audible differences. In other words, that a blind test might reach the conclusion that something is not audible when it actually is.

This issue has been thoroughly covered by Leventhal. The only free resource to his work that I know of is in this group of Stereophile Letters to the Editor. He did some peer-reviewed stuff for the AES that covers the issue in excruciating detail, as well as being very fair, but one must unfortunately purchase the articles from the AES.

amirm · May 14, 2011

Ron Party said:
No, we're not posting past one another. You are rejecting the scientific method.

What method Ron?

amirm · May 15, 2011

Andy, since there is some confusion here, let me clarify my angle.

I am not attempting to disprove the value of blind testing. I am probing at a narrow area of analysis only. I mentioned the scenario before. Here it is again.

Even though we focus on two outcomes: hearing a difference and not, I am saying there is another which is the transition point. In this area, testers by definition are unsure so they vote correctly and other times not. They also tend to second guess themselves because they are being asked to give a Yes/No answer. I believe both of these factors push the statistics to be non-conclusive. And since non-conclusive is taken as "statistically can't tell the difference" I theorize that we tend to opt for negative findings in this situation.

I further add the pre-condition that from objective measurement point, there is a solid difference. So this rules out cables, demags, rocks in the room, etc. I am not debating any of those. So let's not have an argument of who believes in "science" and who doesn't.

I further am interested in how we quantify the region. One of the ways to do that in my opinion is to scrutinize the test itself. In compression testing, we have a set of audio tests we know to be revealing. That makes it easier to push the analysis past the uncertainty point -- at least for expert listeners. The content can be shown mathematically to be revealing of differences.

I content that if you cannot show mathematically that the content is revealing of the difference, then you very well could be operating in the region of uncertainty. You simply don't know.

My hope for this thread was to not have it turn into blind test group vs the other. The use of blind testing to find significant differences to address the needs of the general public is not what I am discussing or feel is necessary to get into. I am focusing on this narrow area.

terryj · May 15, 2011

Hmm, how come ALL of the previous posts I quoted came up again, even tho I had answered them before and this is a completely new post??

amirm said:
Even though we focus on two outcomes: hearing a difference and not, I am saying there is another which is the transition point. In this area, testers by definition are unsure so they vote correctly and other times not.

I am pretty sure this is where I have had problems with your hypothesis before. I think it lies in THIS sentence...In this area, testers by definition are unsure so they vote correctly and other times not. I do get your idea of oscillation, analogous to the 1 or 0 point in digital (or the transition between them).

See, I take it that they vote correctly each time. THAT there was a difference each time, and they did not hear it, is neither here nor there.....????? If they heard it they voted so, if they did not hear it they voted so. One time they heard it (say), voted 'yes', the next time they did not hear it and voted 'no'. You say 'Well, they SHOULD have heard it, so we are getting a null result unfairly' (or words to that effect).

I am saying 'We are getting a null resultfairly because sometimes they heard it and sometimes they didn't'. Same set of tests, exactly the same result and conclusion (null test), yet with completely different 'spin' (if you will) on it.

It seems you are setting it up that there is a right or wrong answer, is they SHOULD have heard it or such? That, I thought, would be the essential question here, CAN they hear it or not, yes or no.

They also tend to second guess themselves because they are being asked to give a Yes/No answer. I believe both of these factors push the statistics to be non-conclusive. And since non-conclusive is taken as "statistically can't tell the difference" I theorize that we tend to opt for negative findings in this situation.

HOW do you know they tend to second guess themselves?? Maybe they do or do not, it just reads as a statement of fact (which it could be for all I know).

IF they reported a sense of second guessing, well to me that is as validly explained by saying the two stimuli were so close together great difficulty was had. No need to decide anything more than that.

I further am interested in how we quantify the region. One of the ways to do that in my opinion is to scrutinize the test itself. In compression testing, we have a set of audio tests we know to be revealing.

How do we know they are revealing if not by blind tests? IS that how 'we know'??

The content can be shown mathematically to be revealing of differences.

I don't quite understand 'mathematically' here, unless you mean things like 'only 10 db down' or stuff like that. Ie we are using JND's?? (which I further assume is done by large scale blind testing?? maybe not in the same sense as we use it for audiophile stuff, just 'unknown stimuli given to listeners to see what *we* can hear or not hear')

I just get the idea that you are arguing from 'they SHOULD have heard it yet didn't' and are using that to hang everything off. Was not there some thing about a swedish radio codec that is used to show why DBTs cannot find the small differences? Is that the type of example in mind here?

andy_c · May 15, 2011

amirm said:
Even though we focus on two outcomes: hearing a difference and not, I am saying there is another which is the transition point.

Actually, it's way more complex than that. For a given trial of a multi-trial test, it is not just a situation of audible/inaudible, or a three-state situation as you postulate. To be as specific as possible, let's talk ABX, where at each trial, one must choose whether X is A or B. For an inaudible difference, the probability of correctly identifying X is 0.5 (same as correctly guessing the state of a flipped coin). For an audible difference, the probability of correctly identifying X is greater than 0.5, and less than or equal to 1. As this probability goes from 0.5 to 1, one could say the effect is more and more audible, and this probability is a continuum within that range.

Let's assume we are clairvoyant, such that we actually know what this probability of correctly identifying X is, and we do a test of N trials, for which the test subject gets M out of N correct. According to some criterion, we reach one of two conclusions:

1) The difference is audible
2) The difference is inaudible

What we'd like to know is:

A) The probability that the test reaches the conclusion that the effect is audible when it's really inaudible.
B) The probability that the test reaches the conclusion that the effect is inaudible when it really is audible.

Now, for the fairest possible test, what should the relationship between the probabilities expressed in A and B above be? That's what Leventhal asks, and derives results that approximate the desired situation.

It seems to me that these questions are the crux of this thread - or at least should be, as the previous statements of its purpose seem somewhat confused to me.

amirm · May 15, 2011

terryj said:
HOW do you know they tend to second guess themselves??

'Cause I have done it! And many times!

Let me expand. When I do a test and I hear a small difference I run a test. I pretend there is no difference and listen again. By doing so, I am able to erase the difference. And then I put aside that intent and hear the difference again. I know for sure then at least for me, I am able to convince myself either way when the difference is subtle. I have had the experience in formal tests and informal ones I have run myself.

Maybe they do or do not, it just reads as a statement of fact (which it could be for all I know).

Again, the experience is real for me. Indeed, this is the motivation for me to probe here. I want to understand how the above factor affects testing results. Note that this is not "guessing." I am not flipping a coin because I just can't tell. I hear a difference and then wonder if I shouldn't have. I then listen again and lo and behold, there is no difference.

How do we know they are revealing if not by blind tests? IS that how 'we know'??

I don't know. It is a quandary which I was hoping we could figure out at least partially by discussing it. I used the analogy before of who scientists find out what happened thousands of years ago in the galaxy. They find secondary evidence to get data, solving the problem of us not having been there when the event occurred. No idea of this is a good analogy or not but is an example of solving impossible problems

.

I don't quite understand 'mathematically' here, unless you mean things like 'only 10 db down' or stuff like that.

I mean essentially that. Let's say I can show that distortion exists at 50 db down but only for 100 milliseconds of a transient. We can here distortion that high but not necessarily so when it lasts so little. This is a small, but measurable level.

I just get the idea that you are arguing from 'they SHOULD have heard it yet didn't' and are using that to hang everything off. Was not there some thing about a swedish radio codec that is used to show why DBTs cannot find the small differences? Is that the type of example in mind here?

I have argued against that test case really existing so don't want to go there

. My angle is not that there is some rare listener who would have heard the issue as was the case above. I am saying that whoever we have picked may be operating as I have been many times, in the area of uncertainty.

terryj · May 15, 2011

(while I remember, looking at your time of the post, you may have missed a better response from andy c above)

Gotcha on the second guessing, not a finding as such but personal experience. It could be possible you are simply 'at that end of the spectrum'?? Ie, entirely possible that few would be as 'analytical' or self examining as you??

Anyway, re the second guessing (did I really hear it? pretend it is not there and see what happens) I have to be honest and say you are starting to be even less convincing for me now.

A personal example of what I think is a similar psychological phenomenon, I might sit you down and play you the latest song which has absolutely blown me away. I mean MAN, I LOVE it, have not stopped listening to it since I came across it ok?

When you are in the hotseat, I look out of the corner of my eye and for whatever reason, I begin to have the slightest thought that you may not like it as much as me. I find that quickly grows, to the extent that I am uncomfortable 'forcing' this sound on you and it does not sound good to me even!!!

Others have said something like 'that happens with movies too' etc etc, I am sure we all have our own examples.

I see that as being entirely similar to your example, in other words it is *trivially* easy to completely change the way we perceive things. Ok, maybe in mine it took an external stimulus (playing the track for someone), but I don't see that the perception changed as proof of much at all.

amirm · May 15, 2011

We can only share personal experiences so yes, that is mine.

Here is another situation. In a third-party run blind test of video codecs, our video codec actually garnered a higher score than the hidden reference in one of the tests. That should be impossible right? The test was a butterfly test where the video was split with the one on the left always the original and the one on the right, the sample under test or the original itself. People were to score from 1 to N, how close it was to the original. Somehow, we managed to get higher scores than the hidden reference.

If you dig into why, we realize that it was possible that our sample had filtered out some of the noise so it garnered higher preference since it actually made the outcome better than what we started with perceptually. That revelation came from analyzing said results. That is what I am trying to do here. Looking underneath the test results and seeing if there could some aberrations. Clearly the blind test results above were inaccurate in claiming a degraded sample was more real than the original sample right next to it. Humans scored and scored wrong.

It is not often that we have outcomes like above where we know the results have to be wrong. When we don't get lucky that way, how can we still determine the levels of accuracy in the test?

Vincent Kars · May 15, 2011

Humans scored and scored wrong.

They didn’t. They never do. Don't blame them for the flaws in your experimental design.
You asked for preference.

Likewise there is a test floating around at Hydrogenaudio.
Two identical pieces of music, which one do you prefer sound quality wise?
Most prefer “B”, indeed a low bit rate MP3 of the original.
As the original recording is very sharp the high rolloff of the MP3 is preferred.

amirm · May 15, 2011

Vincent Kars said:
They didn’t. They never do. Don't blame them for the flaws in your experimental design.
You asked for preference.

First, this was a test conducted by DVD Forum for selection of video codecs for high-definition discs. That test led to adoption of VC-1 in both HD DVD and Blu-ray. So it was not our test.

Second, the scoring was to be relative to the butterfly image to the left. They were not asked to vote for preference. They were asked how close that one on the right came to the one to the left. When we were on the right, we were voted to be closer to the original than the reference itself.

Note that the hidden reference did not score perfectly. It never does in these tests. It always rates lower than perfect because some people think they are being tricked and vote it down as not being as good as the original. So our high watermark is always this value, not a perfect high score.

Likewise there is a test floating around at Hydrogenaudio.
Two identical pieces of music, which one do you prefer sound quality wise?
Most prefer “B”, indeed a low bit rate MP3 of the original.
As the original recording is very sharp the high rolloff of the MP3 is preferred.

This was a three-way test, not two-way. The original was always on screen on the left. The right was either the compressed sample or the original. They did not play two degraded samples and ask which one was better.

sasully · May 16, 2011

andy_c said:
This is called special pleading.

....as I noted way back in one of my first posts on this thread ;>

Amirm, might I suggest you simply contact your former co-worker JJ and ask him where the evidence for JND and the efficacy of DBTs comes from? I'm kind of shocked that much of this seems new to you.

Also, the status quo for determining audio quality hasn't been demonstrated to be a 'better way' than one based more primarily on DBTs. In fact you might want to ask Sean Olive about that...I hear he has his own forum now ;>

sasully · May 16, 2011

andy_c said:
It is evidence of measurable differences, but I assume we're talking audible differences. One could say the measurements are evidence of potential audible differences though.

This thread is a bit confusing to me because of its title, "Do blind tests really prove small differences don't exist?". Of course in general it's impossible to prove a negative as it were, so the answer "no" is almost a given. Actually, the title is a nonsequitur if one examines it critically, because the idea that one could "prove" that something which really does exist does not is nonsense. But it seems to me the intent of the discussion is to determine whether blind tests mask small audible differences. In other words, that a blind test might reach the conclusion that something is not audible when it actually is.

This issue has been thoroughly covered by Leventhal. The only free resource to his work that I know of is in this group of Stereophile Letters to the Editor. He did some peer-reviewed stuff for the AES that covers the issue in excruciating detail, as well as being very fair, but one must unfortunately purchase the articles from the AES.

Leventhal's work ikn this area is mainly a reminder that Type 2 errors exist (not just Type 1), and should be factored into the statistics, and that statistical power of a test is important too.

Neither of these suggest that DBT is intrinsically incapable of 'detecting' small difference. It's really just a call to do them right -- and not to let conclusions exceed the message of the data.

Stereoeditor · May 16, 2011

sasully said:
Leventhal's work ikn this area is mainly a reminder that Type 2 errors exist (not just Type 1), and should be factored into the statistics, and that statistical power of a test is important too.

Neither of these suggest that DBT is intrinsically incapable of 'detecting' small difference. It's really just a call to do them right -- and not to let conclusions exceed the message of the data.

This was the point I was making in the comments of mine that were referenced at the start of this thread: that when the differences are small, and particularly when such differences will not be audible with all program material, you need to run a large number of trials in order to be able to bring the power of statistical analysis to bear.

John Atkinson
Editor, Stereophile

amirm · May 16, 2011

sasully said:
Amirm, might I suggest you simply contact your former co-worker JJ and ask him where the evidence for JND and the efficacy of DBTs comes from? I'm kind of shocked that much of this seems new to you.

Well you are not "shocked." You are making an accusation regarding my knowledge level. Please leave sarcasm and debating language like this for other forums. If you have a reference we call read on how it relates to the theory at hand, then provide it. If it is only in JJ's head, then it will have to remain that way as it is clear then that this is news to the entire planet outside of JJ.

Also, the status quo for determining audio quality hasn't been demonstrated to be a 'better way' than one based more primarily on DBTs. In fact you might want to ask Sean Olive about that...I hear he has his own forum now ;>

Steven, please cut out the sarcasm. It is not how we run WBF. You have some data to share, please share it.

What I have put forth here is a theory for a discussion. It is not put forth as fact, or proof. If you can't get defensive and discuss it without these negative remarks, let's not discuss it at all. I created this thread to give you an opportunity to extend the discussion in this area and all you are doing here is puring cold water on it.

Do blind tests really prove small differences don't exist?

WBF Founding Member

Banned

Well-Known Member

WBF Founding Member

WBF Founding Member

WBF Founding Member

Well-Known Member

Banned

Banned

New Member

Well-Known Member

Banned

New Member

Banned

WBF Technical Expert: Computer Audio

Banned

New Member

New Member

Member

Banned

Similar threads