I wish this discussion of peak levels would go away, as it does not contribute to any insight. The peak levels on the jingling key part of the test are at an entirely reasonable level. They are where anyone who wanted to investigate the format capability would place the levels, as can be seen by the arguments that would ensure had the peak levels been put lower, e.g. -12 dBfs. If this lower level were used then it would be rightly complained that the comparison was between 44/14 and 96/22. Arny's IM test levels also seem entirely reasonable, as I explain below. I am dismayed that some of the supposedly technical people here don't understand what's going on between peak levels and spectrum as averaged by an FFT. It is not necessary to have any theoretical understanding of the mathematical issues involved if one has practical experience working with audio editors, sample rate converters, etc. and doing digital post production of recordings. (I have the requisite mathematical understanding and I also have hundreds of hours of experience doing digital post production, which I do on a volunteer basis.)
If we look at all of the DSP and analog components in the playback chain downstream of the output of the player (e.g. Foobar playing the result of the PC ABX choice) there are a number of places where there could be IMD distortion. These include any DSP in Foobar (hopefully not used) or DSP in an upsampling DAC. Then there is potential nonlinearity in the DAC itself, and more likely the IV converter and output drivers. Finally there is the preamp / amp and speakers/phones. Audible distortion in any of these might invalidate the test. So let's look at what to do about each of these possible sources.
First, if the problem is in the DAC itself then this can be addressed by editing the 24 bit files and putting them at a lower level. This would add 24 bit dither noise to both files, having the effect of reducing the bit depth of the 16 bit material an insignificant amount but reducing the bit depth of the 24 bit down to, say 22 bits. The result is still interesting: 44/16 vs. 44/22. In the case of my DAC, the Mytek stereo 192 DSD, I run this with digital volume control. The input 24 bits gets converted to 32 bits and then this gets adjusted digitally. The results then go to the upsampling, SD modulator, IV and output buffer. As it turns out my gain staging is such that I have about 10 dB of headroom at the volume setting that I used to playback the files. So the DAC is unlikely to add significant distortion to either file.
What I worried about is my amplifier distortion and my tweeters. So I was reluctant to play the test files at a very high volume level, because I was afraid of burning out my tweeters. In this regard I would have preferred if the IM test segment had been significantly shorter with pause time to allow for cool down. However, I also am familiar with the gain staging of my system and the volume setting that I used to play the test files was one that I occasionally use while listening to music, e.g. Mahler symphonies. So I eventually decided to turn the volume up to this setting, which was still a bit quieter than what actual keys jingling would sound like live (because of the peak to average ratio). My system passed the IM test as expected. (I use Focal Twin 6 BE active monitors which are tri-amped and there is a 100 wpc class AB amplifier on the tweeters. These are set to be rolled off -3.5 dB at a 10 kHz shelf measured at my listening position using a calibrated microphone. This was accomplished using the pack panel controls on the Focals, and was made to that the majority of my record library had a natural high frequency balance, with essentially none of the recordings being either too dull or too bright. In the flat setting, some bright recordings were unlistenable.) I did not change any of these settings for the listening tests.
My concern is not with the test files, which seem entirely reasonable. My concern is not with a reasonable playback chain, which should be able to play the recordings undistorted as given. My concern is with the test tool, specifically the Foobar-PC ABX software. By allowing the user ability to specify the start/stop points of the file it makes it trivial to "game" the system, either intentionally or unintentionally. So when I read that someone used PC ABX and heard a difference, I can not reach the conclusion that they actually heard a difference between the two test files. They could simply have been hearing differences due to switching transients. The PC ABX tool does not provide an adequate control over false positive results. Controls of this type may not be needed if a single hobbyist is conducting a listening test for their own purposes, but as a means of gathering evidence with a chance of convincing others, the tool simply does not cut it. If the possibility of convincing others is not a requirement, then there is really no need for any "objective" tests in the first place.
There is no need for the PC ABX software to have this fault. One way would be to fix the PC ABX software to fade in and out at the start points. Another way would be to remove these buttons from the tool (or operate on the honor system and do all the testing without using these buttons at all.) If this made the testing too hard, then shorter segments could be selected as part of test software and distributed to the group It would then be possible to vet these sequences for artifacts related to start/stop. If I were serious about this test this is precisely what I would do: find a promising short segment, edit it with a fade in / fade out and then test that using PC ABX, playing the complete segment.
Also, if these tests were intended to be a serious scientific experiment they would not have mixed apples and oranges. They would have tested a single aspect of PCM formats, e.g. 44/16 vs 44/24 or 44/24 vs. 96/24.