Do blind tests really prove small differences don't exist?

Phelonious Ponk · May 20, 2011

When listening for distortions, I find it best to listen to music I dislike -- polkas, new age noodling, modern country. If I listen to good music, even pretty poorly recorded good music, I get distracted far too quickly.

Tim

amirm · May 20, 2011

Phelonious Ponk said:
When listening for distortions, I find it best to listen to music I dislike -- polkas, new age noodling, modern country. If I listen to good music, even pretty poorly recorded good music, I get distracted far too quickly.

Tim

Second problem is that you will get tired of your favorite music! Long time ago, I was building a darkroom to process my own pictures. I needed a timer to tell me to change chemicals every few seconds. Instead of getting an expensive programmable timer, someone suggested using music and inserting your voice at the right times to move to the next step. Worked like a charm on a $10 cassette deck. Secondary advice given? Don't use your favorite music. Indeed, by the 10th print, I never wanted to hear that music ever again.

Isn't fascinating how that works? I get why we get sick of eating the same thing over and over again. It is body's notion of making sure we eat a variety. But why what we hear? What is the evolutionary or genetic cause for that?

arnyk · May 20, 2011

Phelonious Ponk said:
I assume you are using jitter in a broad sense to address timing errors common to analog but typically not referred to as jitter?

Timing distortion is FM distortion whether you call it jitter or flutter.

Phelonious Ponk · May 20, 2011

arnyk said:
Timing distortion is FM distortion whether you call it jitter or flutter.

Yes, but when it's analog, it's musical.

Tim

arnyk · May 20, 2011

Phelonious Ponk said:
Yes, but when it's analog, it's musical.

Tim

Yes, you have to have a little perspective on this sort of thing. There were whole schools of mixed science and fantasy that was developed to *explain* why new technology such SS and digital sounded worse than traditional analog, even though it measured far better on the bench. A great deal of it was found to disappear when people couldn't see what they were listening to. The portion of the problem that had actual science in it was becasue analog's performance was often poor enough to actually audibly degrade the sound. For example, no analog tape machine has ever been found to be undetectable in an ABX test, but a great deal of digital is.

Orb · May 20, 2011

arnyk said:
Timing distortion is FM distortion whether you call it jitter or flutter.

Just to be clear though Arny.
Paul Miller differentiates between jitter and wow-flutter, with the measurement values not expressing the same conclusions in terms of jitter, their context technically, and audibility.
However I appreciate this does not mean Paul Miller is correct

Still, he is pretty much what I would deem is an expert on this subject due to his skills and experience with developing very technical testing-measurement tools while also engaging with some exceptional academics.

Cheers

Orb

arnyk · May 20, 2011

orb said:
Just to be clear though Arny.
Paul Miller differentiates between jitter and wow-flutter, with the measurement values not expressing the same conclusions in terms of jitter, their context technically, and audibility.
However I appreciate this does not mean Paul Miller is correct
Still, he is pretty much what I would deem is an expert on this subject due to his skills and experience with developing very technical testing-measurement tools while also engaging with some exceptional academics.

I'm fully aware of Paul Miller's biases in this matter. As soon as he provides a justification for his conclusions that are supported by reliable listening tests, I'll stop characterizing him as someone who seems to be chasing numbers for the sake of numbers. ;-)

The audibility of FM distortion is dependent on its amplitude and the modulating frequency. The traditional weighting filter for evaluating the audibility of FM distortion is described by standards IEC 386, DIN45507,
BS4847, and CCIR 409-3 and is shown below:

Orb said:
If memory serves a common kind of FM distortion that people are concerned about these days is related to HDMI data blocking and buffering, and is centered around 100 Hz. The filter characteristic above shows that there has been considerable attention in this frequency range, and so it could very well be valid.

The filter shown above is also the one that is generally recommended for measuring wow and flutter in analog tape equipment. It is well known that analog tape recorders were subject to a form of FM distortion called scrape flutter which could easily reach into the same frequency range as HDMI FM distortion.

Combine this information with the fact that analog tape machine wow and flutter is a mere million times larger, and we are hard put to dismiss the connection or severity of analog tape machine FM distortion completely out of hand.

Orb · May 20, 2011

I am asking this not to state ABX is flawed but more curious if any of those who have done ABX tests also did further blind tests that are subtly different.

Has any of those who have done these tests (whether running or participating) also done what I would call A/X comparing, meaning that there is always only one constant (A). while X is what can change and could be A or different.
So in our audio example A could be Krell SA50 and B is a Crown amp, for the 12+ tests a listener would need to decide if X matches A or is different.

This benefits from removing from a cognitive perspective two constants that occur with ABX, and simplifies the process that should provide matching results.
I mention the following very loosely as it is not usually applied to this type of ABX scenario, but my only thoughts about ABX comes down to cognitive uncertainty heuristic and anchoring bias - I want to stress this has never been proven or even discussed constructively in the context of ABX or even audio.
Both do need to be removed from any blind testing, and this is a verrry long shot (hence my use of the term very loosely) but it is interesting we are talking about small differences where uncertainty will come into effect, and we also have the slight possibility of anchoring reinforcing nulls due to more than one contant.
Of course it may be argued that any subtle comparisons such as these blind audio tests can result in anchoring even if it is not ABX.
But then all factors contributing should be weighted in analysing the statistics/behaviour of the test.

Hence why I am interested if anyone has also backed up their AB/X with A/X.

I am with JA that IMO it is incredibly difficult to remove all factors that you do not want when it comes to blind testing - this is covered in the discussion between JA and Arny on the ABX debate, so if want to hear more with both sides presenting useful info it is worth listening to.

Personally I would like to see more of ABX tests done using a rather complex hardware/software setup that records/analyses the responses of the listener - how many times they use A and B and duration, how many times switch, length of time for decision selection,etc.
I cannot find the paper and IMO it is not complete but they identified a subtle AB order bias when doing a sound perception study using trained/professional staff.
They only identified this due to what I mention about using a complex hardware/software setup that could analyse completely the behaviour of the participant.
I want to stress that the AB order bias did not invalidate their own results as they could weight this factor in analysing, also while they identified this bias behaviour they do state that the mechanisms involved is unknown and further study (new tests/model/etc) would be required.
Also the affects in their testing was slight, so I would not say this invalidates ABX but is an example how complex blind testing can be as stated many times by JA.

Apologies for the lengthy post, if responding please split between my asking on A/X and the rest I go on about that I have not the time to proof read (sure some will have a lot to say

)

Thanks and have a good weekend all

Orb

Orb · May 20, 2011

arnyk said:
I'm fully aware of Paul Miller's biases in this matter. As soon as he provides a justification for his conclusions that are supported by reliable listening tests, I'll stop characterizing him as someone who seems to be chasing numbers for the sake of numbers. ;-)
......

Arny, any reason why you feel the need to denigrate someone who IMO and probably many others understands this subject much better than any of us?
I doubt you will find he needs to "justify" anything to you or me, and seriously doubt he even cares but as I mentioned he does work with some exceptional academics as well.

Anyway, it is fair to say that you both disagree on comparing wow-flutter-jitter, without denigrating you

Cheers
Orb

arnyk · May 20, 2011

Orb said:
Has any of those who have done these tests (whether running or participating) also done what I would call A/X comparing, meaning that there is always only one constant (A). while X is what can change and could be A or different.

So in our audio example A could be Krell SA50 and B is a Crown amp, for the 12+ tests a listener would need to decide if X matches A or is different.

AFAIK nobody has done this kind of test. It is basically a test in which one listens to X's that are amplifier B about half the time, but has no way to compare X's to B. Its a forced choice test in which one of the choices can be knowingly investigated, but the other choice cannot be knowingly investigated. In short, it could be a very frustrating test.

This benefits from removing from a cognitive perspective two constants that occur with ABX, and simplifies the process that should provide matching results.

I can imagine doing this kind of test and being frustrated to the point of quitting.

I am with JA that IMO it is incredibly difficult to remove all factors that you do not want when it comes to blind testing - this is covered in the discussion between JA and Arny on the ABX debate, so if want to hear more with both sides presenting useful info it is worth listening to.

The irony of JA complaining about uncontrolled factors in blind tests seems pretty extreme given his apparent strong preference for listening tests where well known strong influencing factors such as sight are intentionally not controlled. ;-)

Personally I would like to see more of ABX tests done using a rather complex hardware/software setup that records/analyses the responses of the listener - how many times they use A and B and duration, how many times switch, length of time for decision selection,etc.

Be my guest. Since most ABX comparators are implemented in software, collecting this kind of information is just a matter of trivial changes to the software. AFAIK there are ABX Comparators that are open source (Java, for example)

I cannot find the paper and IMO it is not complete but they identified a subtle AB order bias when doing a sound perception study using trained/professional staff.

Order bias is well known to us. Since the order of presentation in ABX is naturally randomized, we don't worry about it too much.

arnyk · May 20, 2011

Orb said:
Arny, any reason why you feel the need to denigrate someone who IMO and probably many others understands this subject much better than any of us?

Given that biases are characteristic of all human beings, I am amazed that pointing out someone's biases is thought of as being denigration.

I doubt you will find he needs to "justify" anything to you or me, and seriously doubt he even cares but as I mentioned he does work with some exceptional academics as well.

You're making this unnecessarily very personal.

Well, this apparently hearty defense pretty well explains your biases! ;-)

I think we all need to justify our findings to Science, and IMO none of us are immune. Academics have no known monopoly on the ability to think or experiment. My abilities to do both improved impressively after I left the academy, but that was just a matter of normal personal development. ;-) There are academics whose published work may effectively criticize Miller's opinions. And non-academic works as well.

Anyway, it is fair to say that you both disagree on comparing wow-flutter-jitter, without denigrating you

I really don't know that Miller and I agree or disagree. The means by which he comes up with his numbers and opinions about jitter have escaped my abilities to obtain and study them. It is pretty easy to come up with subjective experiments that seem to disagree with his published results, if his published results represent anything but numbers.

Stereoeditor · May 20, 2011

arnyk said:
Orb said:

I am with JA that IMO it is incredibly difficult to remove all factors that you do not want when it comes to blind testing - this is covered in the discussion between JA and Arny on the ABX debate, so if want to hear more with both sides presenting useful info it is worth listening to.

Click to expand...

The irony of JA complaining about uncontrolled factors in blind tests seems pretty extreme given his apparent strong preference for listening tests where well known strong influencing factors such as sight are intentionally not controlled. ;-)

Ironic, maybe. My position is that to perform an uncontrolled, unvalidated blind test is pointless. Many such tests have been published by others; all produce null results, but there is no indication that this was due to there not being an audible difference or whether it was due to the poor design or implementation of the test, htus producing a false negative.

Yes, Stereophile magazine's reviewers practice sighted evaluations, and the danger is that that can produce false positives. But as that is self-evident and as we do encourage our readers always to test our review findings for themselves, I think the situation acceptable and preferable to the apparently endless parade of possible false negatives that the Hydrogen Audio crowd use to bolster their own belief systems.

John Atkinson
Editor, Stereophile

FrantzM · May 20, 2011

JA

I do understand and respect your position. it remains however that the bias introduced by sight or knowledge of the equipment is overwhelming. The level of false positives is high, way too high .. Finding a better solution is needed rather than clinging to a model that as you have said produces so many false positives. While many are satisfied by the sheer entertainment value provided by this model many and IMO an ever increasing number of audiophiles are yearning for a stricter method. Introducing some ways to diminish the biases would go a long way in making equipment reviews more meaningful

microstrip · May 20, 2011

Stereoeditor said:
(...) My position is that to perform an uncontrolled, unvalidated blind test is pointless. Many such tests have been published by others; all produce null results, but there is no indication that this was due to there not being an audible difference or whether it was due to the poor design or implementation of the test, thus producing a false negative. (...)

John Atkinson
Editor, Stereophile

I have presented a similar position several times in this forum, and I believe we are still a long way before someone presents reliable valid blind tests on audibility of small differences. Plenty of resources are needed to properly carry these tests and no one wants to spend a fortune just to please a few forum members.

terryj · May 20, 2011

Stereoeditor said:
Many such tests have been published by others; all produce null results, but there is no indication that this was due to there not being an audible difference or whether it was due to the poor design or implementation of the test, htus producing a false negative.

Trouble is, you only think it is a false negative because you have, a priori, decided that there should be a positive?

Reverse the historical track, in the parallel universe I'm talking about ALL have compared audio gear level matched and blinded. We are all used to finding that there are very little differences between some components.

All of a sudden along comes an idea that we go away from that model, and go to uncontrolled sighted tests. In THAT world the raised eyebrows would be towards HUGE differences with magic beads.

IOW we always tend to reject the new, strange, 'unusual'.

Which, IIRC, is the essential question asked by amir in the original post. HOW can we know that these negatives are false? How can it be proved?

Yes, Stereophile magazine's reviewers practice sighted evaluations, and the danger is that that can produce false positives. But as that is self-evident and as we do encourage our readers always to test our review findings for themselves,

I get a tad annoyed whenever I see this line....

I'd be more ready to accept the 'honest intentions' that are pushed here IF I saw something like 'It is a well known fact that we ca be swayed by biases when auditioning audio gear. Hence, we urge you when evaluating what we have said here to ensure these factors are controlled for ' yada yada.

Oh no, the industry idea of 'checking for themselves' is to carry on as usual, using procedures many times worse than the intellectual objections to blind tests you have raised above.

Ha, the irony is that this method may give a result opposite to the review...ie he did NOT like the cable recommended.

No matter, because the purpose none-the-less has been achieved, the 'fact' of cable differences and sound (for example), the need to have aftermarket cables, and the validation of the entire flawed process.

The entire creaky structure is continually refreshed and renewed.

So OK, have your doubts on blind tests, (maybe make some tests and studies that overcome these deficiencies rather than moan about them) but please, spare me the hypocrisy that 'we advise them to check our sighted evaluations for themselves via their own sighted evaluations' as some sort of proof or method of overcoming the shortcomings of sighted reviews in magazines.

"""Dbt's are flawed, as are sighted. We overcome the flaws of sighted evaluations by sighted evaluations, and there are no cures for DBT's flaws""' does not inspire me I am afraid.

John, any thoughts on how we can change the tests to help remove the types of errors being discussed?

Stereoeditor · May 20, 2011

terryj said:
Stereoeditor said:

Many such tests have been published by others; all produce null results, but there is no indication that this was due to there not being an audible difference or whether it was due to the poor design or implementation of the test, thus producing a false negative.

Click to expand...

Trouble is, you only think it is a false negative because you have, a priori, decided that there should be a positive?

You are correct, because when you examine the test, there were measurable differences between the devices under test that other research show should have been audible. For example, in the January 1987 Stereo Review amplifier tests, one of the amplifiers had such a high output impedance that with the speaker being used, there would have been audible response differences. Yet as the test produced a null result, there must have been something amiss with the experimental design. Yet that test was widely proclaimed at the time as "proving" that the amplifiers sounded identical.

Similarly, since the there has been an endless stream of tests "proving" the same thing. As far as I can tell, all used too few trials or had interfering variables or had other problems. This is bad science used by people with an agenda. As I wrote many years ago when referring to such bad science, having taken part as a listener in tests organized by, some very public proponents of blind tests, "the blind test is the last refuge of the agenda-driven scoundrel."

any thoughts on how we can change the tests to help remove the types of errors being discussed?

As I wrote, you need to use a large number of trials in tests where the variables are reduced to just what you are concerned about. It is difficult and time- and resource-consuming. For example, I visited B&O's research labs in Denmark last spring. They do a lot of blind testing, using their core of trained listeners. One test on one variable can actually take weeks with hundreds of trials, but that is what it takes so you can then apply statistical analysis. How could that rigor that be applied to the needs of a monthly review magazine?

So I accept the possibility of false positives and expect my readers to act as the intelligent, thoughtful people they are when assigning value to our review findings.

John Atkinson
Editor, Stereophile

Phelonious Ponk · May 20, 2011

The irony of JA complaining about uncontrolled factors in blind tests seems pretty extreme given his apparent strong preference for listening tests where well known strong influencing factors such as sight are intentionally not controlled. ;-)

An objective, statistically valid AB/X test isn't easy and requires a lot of subjects, samples and time. For every one of them there are probably hundreds of casual ABX tests going on in people's homes, that really don't indicate much more than what one listener heard in his own system. But it does indicate what he actually heard, not what he saw and, therefore, expected, so it's still way the hell ahead of sighted testing.

Tim

amirm · May 20, 2011

I don't understand the angst over stereophile reviews. I jump directly to the measurements part and learn a ton about the equipment under test. And all of that data is *objective*. You don't have to read their subjective analysis. I glance at them and there, I usually find more objective data such as design and history of the equipment.

arnyk · May 21, 2011

Stereoeditor said:
Ironic, maybe. My position is that to perform an uncontrolled, unvalidated blind test is pointless.

Things are getting even more ironic. John, if we could only get you to admit that performing any uncontrolled, unvalidated test is pointless.

But on the positive side, I agree that performing uncontrolled, unvalidated blind test is as pointless as it is impossible. As soon as you admit that the test is blind, you've admitted that it is controlled. Uncontrolled blind test *is* an oxymoron. All bind tests are by definition controlled. Perhaps a pedanitc point but a good thing for an audio editor to know! ;-)

To be utterly accurate John, you need to say what I will say right now and mean it with all my heart:

Performing inadequately controlled, inadequately validated tests of any kind, blind or sighted is a gigantic waste of time, probably both self-deceptive and generally deceptive, not to mention being a waste of time.

Say it with me, John if you are that brave and interested in finding the truth!

Many such tests have been published by others; all produce null results,

This is misleading because it makes the usual biased audiophile mistake of excoriating blind tests for the faults of almost all subjective evaluations that are published, posted, or shared in innumerable other ways. They are often dominated by the listener's prior state of mind, not the state of his immediate sensations or the performance of the equipment.

Inadequately controlled tests are published by all of the high end ragazines and shared by thousands of audiophiles and audio professionals much of the time, and they often produce random results that are generally disconnected from the equipment they purport to evaluate. The results that are published are often as useless as null results because they are meaningless.

but there is no indication that this was due to there not being an audible difference or whether it was due to he poor design or implementation of the test, thus producing a false negative.

John., given your actual day-to-day position on publishing poorly designed tests, which is to do it early and often, this is ironic indeed. ;-)

Yes, Stereophile magazine's reviewers practice sighted evaluations, and the danger is that that can produce false positives.

That is a masterful understatement, John. Saying that sighted evaluations can produce false positives is like saying that ramming your car into a 10 foot thick reinforced concrete wall at 90 mph can produce damage to the automobile! l-)

There is only one way that a sighted evaluation can overcome its inherent bias, and that is to involve audible differences of such a monumental nature that they can overcome sighted bias a statistically significant percentage of the time. Such differences take us out of the world of useless tweaks and the differences among good amplifiers and DACs, and finds us twirling the knobs of a parametric equalizer, or picking microphones or microphone positions.

Phelonious Ponk · May 21, 2011

I don't understand the angst over stereophile reviews.

Why the angst? It depends on whether you believe such sloppy reporting from authority is just causing the occasional false positive, or systematically degrading the objectives of the industry and the hobby over time. Why Stereophile? Because they earned the responsibility. When I see a loving tome to the resolution of an SET amp played through a pair of horns into an untreated glass and ceramic tile room I can laugh it off; I expect no less. I expect more from Stereophile. I expect them to check their expectations at the door and give me a bit of critical thinking with my criticism.

Tim

Do blind tests really prove small differences don't exist?

New Member

Banned

New Member

New Member

New Member

New Member

New Member

New Member

New Member

New Member

New Member

Member

Member Sponsor & WBF Founding Member

VIP/Donor

New Member

Member

New Member

Banned

New Member

New Member

Similar threads