Understanding ABX Test Confidence Statistics

amirm

Banned
Apr 2, 2010
15,813
37
0
Seattle, WA
OK, a mouthful of words for the title but this is an important topic for which there is next to nothing online. The issue has come up recently because of the double blind test published by Stuart et al. as I explain below here the threshold of confidence in the results was just 56% right answers. This has caused many to dismiss the results as its results being little better than "chance." That is completely wrong. Below, I explained this on AVS Forum on the poster making the same mistake. I will make this a formal article later but I thought I share it now to get better awareness of this important topic.

===========

m.zillch on AVS Forum said:
56% correct responses don't lie (instead of a random coin flip's 50% results) and it conclusively shows, with statistical significance, that yes, you made the right decision to only buy THE BEST!" [/B]- not a real quote :D

I have answered this a few times but since it seems persistent, let me explain this in more detail.

ABX is a type of "forced choice" testing. At all times, the user can click on X being A or B. He has the answers. He just has to select the right one. Or vote randomly. We want to separate these two outcomes. To do that we use statistical analysis. And pick a threshold that says the probability of the listener randomly voting is less than 5%. Or put inversely, 95% chance that the results are not due to chance. Everyone more or less knows this part.

What is not known is the math that leads to this and how non-intuitive it is. Before I get into that, zillch is referencing the Stuart et al. peer reviewed listening test that was published in the AES journal. In there, they mention that the threshold that they had to cross was 56% hence the number zillch is using above. Note that this was NOT the outcome. The outcome was actually better than this. But the threshold for 95% confidence interval was just 56% of the listener answers being right.

As zillch says, this makes no sense, right? I mean 50% correct answers would be "pure chance" and the listener guessing. How on earth can getting just 6% more right answers gets us to 95% confidence? The answer lies in statistics. And the math here is conclusive and not subject to debate. Let me explain a bit of it.

Our ABX test has a statistical distribution that is "binomial." The listener either gets the results right or wrong (hence the starting letters "bi" or two outcomes). Probability of the listener getting the answer right is 0.5 or one out of two chances of being right. Given these two values, statistical math instantly gives us how many "right" answers we have to get to right, to achieve 95% confidence we desire.

If you want to follow along and repeat the math I am about to show you and have excel, the formula is "binom.inv". Here are the number of right answers we need to get for different number of trials to achieve 95% confidence and the percent right that it represents:

Trials: Number Right, Percent
10: 8, 80%
20: 14, 70%
40: 25, 63%
80: 47, 59%
160: 90, 56%

Bam! :D we get the same answer as in the Stuart paper. It only takes 90 right answers out of 160 trials they ran, or 56% right, to achieve 95% confidence that the results were not due to chance.

To really blow your mind, we only need 95 right answers out of 160 to achieve 99% confidence the results are not due to chance! This only represents 59% right answers!!!

Again, what I just explained is purely from statistical theory and math. It cannot be debated or second guessed. It says what it says and that is the end of that. The fact that in our belly it seems wrong that 50% would be pure chance and 59% means 99% confidence is cause to not use lay logic to examine these complex topics.

As I said at the outset, the results of the Stuart test was actually better than 56% as I have shown before. Here are the results again:



The dashed line is the 95% confidence line. The vertical bars show the percent right. Notice how with the exception of one test, the rest easily clear the 95% confidence interval of 56% right answers. So there is nothing wrong there to make fun of. Here is the paper itself saying the same:

The dotted line shows performance that is signicantly different from chance at the p<0.05 level calculated using the binomial distribution (56.25% correct comprising 160 trials combined across listeners for each condition).

So in summary, you cannot, can NOT, use the percentage right answers as your confidence number in the outcome of ABX tests. That magnitude of that percentage in a sense is meaningless (because there is another important variable which is the number of trials). You need to compute the statistical formula and rely on that. Doing otherwise just leads to the wrong conclusions. The proof of this is mathematical and is not debatable or matter of opinion.
 

jkeny

Industry Expert, Member Sponsor
Feb 9, 2012
3,374
42
383
Ireland
Thanks, Amir - a nice clear explanation.
I believe the man in the street (& a lot of ABX testers) feel that if there is a worthwhile, noticeable difference between two things then the chances of differentiating between them will be near to 100%.
They fail to recognise why an ABX test is used & that it's results are based on statistical analysis.

BTW, did you write that article for WSR on the Stuart paper yet?
 

jkeny

Industry Expert, Member Sponsor
Feb 9, 2012
3,374
42
383
Ireland
I did. It should be in print now in WSR magazine. I will put it online in a few weeks.

Great - looking forward to reading it, thanks.
 

astrotoy

VIP/Donor
May 24, 2010
1,547
1,017
1,715
SF Bay Area
Thanks, Amir. Great explanation.

Another example is coin flips. If you do 5 flips, then you will get at least 60% of the same result with a "fair" coin. However, if you do 1000 flips and 560 come up heads, then you know with very high probability that you have a biased coin. If you have a roulette wheel in Las Vegas that turns up red 56% and black 44% (ignoring the green spots) you could make a lot of money. So 56% is very far from 50%.

Another example, very close to the n=160, is a baseball season, which is 162 games. A team that wins 91 games and loses 71 games, is a 56% winning team, and usually wins the division pennant or at least is in the playoffs. Contrast this with a team that is 50%, 81 wins, 81 losses. They are 10 games back of our 91 win team and are back in their off season homes for the month of October. A big difference between 56% and 50% in a 162 game season.

Larry
 

microstrip

VIP/Donor
May 30, 2010
20,806
4,698
2,790
Portugal
Amir,

Great you came to this subject in such a concise and applied way, although I fear you are putting too much pressure on most ABX testers ;). Can I suggest you give us also a complementary explanation of what is statistical significance / confidence level?
 

About us

  • What’s Best Forum is THE forum for high end audio, product reviews, advice and sharing experiences on the best of everything else. This is THE place where audiophiles and audio companies discuss vintage, contemporary and new audio products, music servers, music streamers, computer audio, digital-to-analog converters, turntables, phono stages, cartridges, reel-to-reel tape machines, speakers, headphones and tube and solid-state amplification. Founded in 2010 What’s Best Forum invites intelligent and courteous people of all interests and backgrounds to describe and discuss the best of everything. From beginners to life-long hobbyists to industry professionals, we enjoy learning about new things and meeting new people, and participating in spirited debates.

Quick Navigation

User Menu

Steve Williams
Site Founder | Site Owner | Administrator
Ron Resnick
Site Co-Owner | Administrator
Julian (The Fixer)
Website Build | Marketing Managersing