It indeed does illustrate your point - there were 20 that correctly identified all 3 24bit samples but these were then diluted by the random guessers to the point that the overall result became a null result - no better than random guessing.
As you say in your post ""scientific" statement about audibility of 16 vs 24, is wrong and we run foul of simpson's paradox"
On the other hand if the objective is to see what a sample of the "audiophile community" can discern (without training) then it does answer this - no they can't. But is this surprising? No pre-selection of participants was used to check for hearing loss or other defects which may diminish their discernment of audio impairments of a known level.
The stated objectives were:
1. How "easy" was it for people to detect (or report) a difference?
2. How accurate were the respondents in detecting the 24-bit sample?
It's somewhat difficult to know what the question being addressed is & therefore what the null hypothesis is