Conclusive "Proof" that higher resolution audio sounds different

jkeny

Industry Expert, Member Sponsor
Feb 9, 2012
3,374
42
383
Ireland
..."

Just a question about this stated expectation being 17.5 out of 140 if all 140 respondents guessed randomly - I make it 3/6 X 2/5 X 1/4 or 6/120 or 6 correct trials out of 120 trials which is equivalent to 7 correct out of 140 trials. This is for one person doing the trials. What difference to the probability does it make if 140 different people do one trial each? Where does the 17.5 expectation come from - I know it's 140/8 but why 8? Amir?
....
Sorted this out - the test was done with 3 pairs of files - each pair contained one 24bit & one 16bit fie. The listeners did 3 separate listening trials on each pair to identify the 24bit file in each trial. So the chances of identifying all 3 high-res files correctly is 1 in 8

I also have in the back of my mind that running the test over 140 different people is in some way statistically different than running the test 140 ties with one person. There's some adjustment to the stats as a result but I can't remember what it is. Anybody know about this?
 

esldude

New Member
Sorted this out - the test was done with 3 pairs of files - each pair contained one 24bit & one 16bit fie. The listeners did 3 separate listening trials on each pair to identify the 24bit file in each trial. So the chances of identifying all 3 high-res files correctly is 1 in 8

I also have in the back of my mind that running the test over 140 different people is in some way statistically different than running the test 140 ties with one person. There's some adjustment to the stats as a result but I can't remember what it is. Anybody know about this?

Don't believe there is any change whether 140 for one person or 140 people one time. Think of it this way. If you sit and flip a coin 140 times or have 140 people come and each flip it once, the result of heads/tails should be right near 50% either way.
 

jkeny

Industry Expert, Member Sponsor
Feb 9, 2012
3,374
42
383
Ireland
Don't believe there is any change whether 140 for one person or 140 people one time. Think of it this way. If you sit and flip a coin 140 times or have 140 people come and each flip it once, the result of heads/tails should be right near 50% either way.

Yes, I know that is what we intuitively conclude but this is in the field of perceptual testing with multiple different playback environments, etc. When you think about this type of testing where a number of different audio files are played through a playback system - there has to be a difference in the statistical treatment of results between one person doing this test 140 times on the same playback system & 140 different people doing it on 140 different playback systems, no?

I know that there is a mathematical adjustment needed but I can't remember where I read it - maybe Leventhal?
 
Last edited:

Kal Rubinson

Well-Known Member
May 4, 2010
2,360
697
1,700
NYC
www.stereophile.com

KBK

New Member
Jan 3, 2013
111
1
0
Don't believe there is any change whether 140 for one person or 140 people one time. Think of it this way. If you sit and flip a coin 140 times or have 140 people come and each flip it once, the result of heads/tails should be right near 50% either way.

Should being the operative word.

Stats are only as valid as the formulation of the question they ask. As numbers go, they are as usual: wholly ignorant. Thus, in the realm of logic, in the case of a human ruminating, as it should be --and is only proper: give no weight of intelligence to numbers. What is behind the thrust of numbers is the entire plot. Numbers may not lie, but they are also as dumb as a bag of hammers, possessing neither directive or wit of their own, except that which creates and manipulates them. In most cases, numbers are directed and moved about by humans... and humans..well, humans..can and do fail. Often. Since numbers are extensions of humans.. in their manipulation and motion, one must even look for the entire plethora of human failures within them, including intent to deceive and manipulate others.

In the case of humans, we have learning curves, quality of equipment and many other variables that make the premise of 140 different people vs 140 tests being done with the same person --that those two do not equate.
 
Last edited:

amirm

Banned
Apr 2, 2010
15,813
37
0
Seattle, WA
In the case of humans, we have learning curves, quality of equipment and many other variables that make the premise of 140 different people vs 140 tests being done with the same person --that those two do not equate.
This is the reason you want to have each tester participate enough to get valid statistical results. That way, you can throw out the stats from people who don't have critical listening skills. Otherwise, the worst sin occurs which is to combine the results of people who can't hear small impairments with those who do. This is called "Simpson's Paradox" where combining results generate reverse and invalid results. It was at the heart of a discrimination suit against UC Berkeley, saying they rejected more female candidates than males. Department by department stats showed this was not the case and the combined stats were triggered by women applying more to departments/majors that were hard to get into so they had more rejections. See https://en.wikipedia.org/wiki/Simpson's_paradox.

And this is a key component in statistics: you can't combine dissimilar data. All experiments must be identical in all regards or you can't look at combine statistics. Sadly, tests like Meyer and Moran violated this very basic principal by letting experiments use different setups, proceedures, etc. (see http://www.madronadigital.com/Library/High Resolution Audio/Statistics of ABX Testing.html).
 

KBK

New Member
Jan 3, 2013
111
1
0
Oddly enough, this makes that infamous image, true, for once.

43.jpg

(I know little of this situation, it is just that the name --made the image come to mind. No real nastiness intended)

I am also reminded of James Randi's 'Challenge', which in any form of reasonable analysis, was found to be wholly invalid, on all possible levels. In his proposed challenge, it is deemed (from my reading of rebuttals on it's veracity) that there would be nary an item, drug, or medical trial of any kind that could have passed his criteria. Thus his usefulness, to a given 'machine' and it's associated machinations; this usefulness, as a weapon against the masses - to hold, and most explicity pointed at, now... notable numbers of the engineering masses - to hold their ignorance. Their ignorance of the understanding that numbers are only as valid as the minds behind the given number's intent and directive. That dogma cannot exist in a functional system that can move forward. Dogma is death, as a self limited spiral into nothing. Repetition is death, it's not life - it is a simple pattern. A commodity in the eyes of larger systems.
 
Last edited:

Tony Lauck

New Member
Aug 19, 2014
140
0
0
Oddly enough, this makes that infamous image, true, for once.

View attachment 21516

(I know little of this situation, it is just that the name --made the image come to mind. No real nastiness intended)

I am also reminded of James Randi's 'Challenge', which in any form of reasonable analysis, was found to be wholly invalid, on all possible levels. In his proposed challenge, it is deemed (from my reading of rebuttals on it's veracity) that there would be nary an item, drug, or medical trial of any kind that could have passed his criteria. Thus his usefulness, to a given 'machine' and it's associated machinations; this usefulness, as a weapon against the masses - to hold, and most explicity pointed at, now... notable numbers of the engineering masses - to hold their ignorance. Their ignorance of the understanding that numbers are only as valid as the minds behind the given number's intent and directive. That dogma cannot exist in a functional system that can move forward. Dogma is death, as a self limited spiral into nothing. Repetition is death, it's not life - it is a simple pattern. A commodity in the eyes of larger systems.

I have a simple policy. I accept "statistically significant" studies that are consistent with my personal experience. I ignore "statistically significant" studies that are not unless the positive results are so strong that they could not be the result of "file drawer" effects. This pretty much rules out all statistically significant results at the 5% level. Irrelevant. Data points to be put in the back of my head, but proof of nothing.

There is also the possibility of conscious bias (dishonest) and unconscious bias (ignorant) on the part of the experimenter. This is common in most amateur "scientific" experiments.

Amir's positive results strike me as statistically significant. I don't believe he could have stood the number of times his dog barked if he had done thousands of tests needed to succeed by file-drawer methods. :)
 

jkeny

Industry Expert, Member Sponsor
Feb 9, 2012
3,374
42
383
Ireland

Thanks, Kal, that was what I was looking for - the Bonferroni correction - which is a way of dealing with the increase in false positives as a result of multiple testing.

Amir's point also applies - Simpson's Paradox applies when all results are combined - thus diluting any significant results & burying them within the combined stats. Although this has been counteracted here because it was pointed out that 20 people selected the 24bit sample in each of the 3 pairs of samples (B-A-A). It's also a significant group of 21 that selected the 16bit sample in each pair of sample (A-B-B). Both of these groups needed following up to determine how significant these results were.

Thanks all for the replies!
 

amirm

Banned
Apr 2, 2010
15,813
37
0
Seattle, WA
John, is this a different test than the one I passed earlier? http://www.whatsbestforum.com/showt...unds-different&p=279735&viewfull=1#post279735

Speaking of Archimago, he had put forward his own challenge of 16 vs 24 bit a while ago (keeping the sampling rate constant). I had downloaded his files but up to now, had forgotten to take a listen. This post prompted me to do that. On two of the clips I had no luck finding the difference in the couple of minutes I devoted to them. On the third one though, I managed to find the right segment quickly and tell them apart:

============

foo_abx 1.3.4 report
foobar2000 v1.3.2
2014/08/02 13:52:46

File A: C:\Users\Amir\Music\Archimago\24-bit Audio Test (Hi-Res 24-96, FLAC, 2014)\01 - Sample A - Bozza - La Voie Triomphale.flac
File B: C:\Users\Amir\Music\Archimago\24-bit Audio Test (Hi-Res 24-96, FLAC, 2014)\02 - Sample B - Bozza - La Voie Triomphale.flac

13:52:46 : Test started.
13:54:02 : 01/01 50.0%
13:54:11 : 01/02 75.0%
13:54:57 : 02/03 50.0%
13:55:08 : 03/04 31.3%
13:55:15 : 04/05 18.8%
13:55:24 : 05/06 10.9%
13:55:32 : 06/07 6.3%
13:55:38 : 07/08 3.5%
13:55:48 : 08/09 2.0%
13:56:02 : 09/10 1.1%
13:56:08 : 10/11 0.6%
13:56:28 : 11/12 0.3%
13:56:37 : 12/13 0.2%
13:56:49 : 13/14 0.1%
13:56:58 : 14/15 0.0%
13:57:05 : Test finished.

----------
Total: 14/15 (0.0%)
 

jkeny

Industry Expert, Member Sponsor
Feb 9, 2012
3,374
42
383
Ireland

Yes, Amir - that's the one. Sorry I didn't see this post of yours before.
It indeed does illustrate your point - there were 20 that correctly identified all 3 24bit samples but these were then diluted by the random guessers to the point that the overall result became a null result - no better than random guessing.
As you say in your post ""scientific" statement about audibility of 16 vs 24, is wrong and we run foul of simpson's paradox"
On the other hand if the objective is to see what a sample of the "audiophile community" can discern (without training) then it does answer this - no they can't. But is this surprising? No pre-selection of participants was used to check for hearing loss or other defects which may diminish their discernment of audio impairments of a known level.

The stated objectives were:
1. How "easy" was it for people to detect (or report) a difference?
2. How accurate were the respondents in detecting the 24-bit sample?

It's somewhat difficult to know what the question being addressed is & therefore what the null hypothesis is
 

jkeny

Industry Expert, Member Sponsor
Feb 9, 2012
3,374
42
383
Ireland
Deleted
 
Last edited:
This is the reason you want to have each tester participate enough to get valid statistical results. That way, you can throw out the stats from people who don't have critical listening skills. Otherwise, the worst sin occurs which is to combine the results of people who can't hear small impairments with those who do….

The best approach would be to perform two tests. The first with all individuals, and the purpose is to find those who can hear. The criterion should be at least p < 0.05 null rejection for those people, or even more stringent. But the statistics in the first test can't be counted toward the conclusion. It is only for screening purposes. You must plan that in advance and stick to it. Then, in the second test everyone's result must be counted. Every second test must be reported.

If you perform only one test and simply exclude people due to poor performance, that is cherry-picking, and you can't produce a meaningful result that way. Likewise if you do many second tests, and only report the good ones.

There might be some way to only do one test with some kind of post-hoc rejection. Generally such things are only permitted for a small number of "outliers" who are 3 or more standard deviations away from the mean. I think that would not be possible for these kinds of tests (we are looking for the elite pre-trained listeners, not "The Average Listener"), but even if it could it would raise the possibility that you chose the post-selection method post-hoc also. I think many would not be happy with this kind of determination. I would not be happy with it. But if you were to do things this way anyway, it would help if your "unqualified listener" determination method was strictly determined in advance also as that would mitigate a major argument against it.
 

Orb

New Member
Sep 8, 2010
3,010
2
0
Good news everyone.
Our new resident expert (I think that is how he seems himself) has made a grand statement showing that we can ignore everything that happened, its value and its meaning...

I did, as they are all presented in my post. Amir simply did a home test of decidedly non-DSP/Perceptual expert Arny K online files. That's it.
No more, no less. Not a Bob Stuart/JJ administered type AES submitted paper. You're still not clear on this? Or the significance?

cheers,

AJ

I am sure everyone will feel a sigh of relief that they no longer have to try to find useful and important information in the middle of the 150 pages :)
Oh wait, its AJ Soundfield.
Cheers
Orb
 

About us

  • What’s Best Forum is THE forum for high end audio, product reviews, advice and sharing experiences on the best of everything else. This is THE place where audiophiles and audio companies discuss vintage, contemporary and new audio products, music servers, music streamers, computer audio, digital-to-analog converters, turntables, phono stages, cartridges, reel-to-reel tape machines, speakers, headphones and tube and solid-state amplification. Founded in 2010 What’s Best Forum invites intelligent and courteous people of all interests and backgrounds to describe and discuss the best of everything. From beginners to life-long hobbyists to industry professionals, we enjoy learning about new things and meeting new people, and participating in spirited debates.

Quick Navigation

User Menu

Steve Williams
Site Founder | Site Owner | Administrator
Ron Resnick
Site Co-Owner | Administrator
Julian (The Fixer)
Website Build | Marketing Managersing