But even that doesn't tell the whole story. The impedance may dip on large transients or something like that. But then impedance is just a 'metaphor' for what the feedback amplifier is actually doing, just as the frequency domain is a convenient substitute for actually studying what the amplifier does with real signals in the time domain - which is, apparently too hard to do. People test amplifiers with resistive loads and, when challenged about this, say that because they know how the amplifier works, they can be sure that it is valid. (In another forum, I saw that any suggestion of building a more representative dummy load was met with sneers from the real experts.) This presumption that the tester knows more about how an amplifier works than the amplifier designer is part of what seems like a circularity in some audio testing.
So testing leaves gaps where people can speculate about transient intermodulation distortion, thermal distortion effects, power supply effects, difficult loads etc. and just shrug their shoulders saying that amplifiers that measure perfectly can also sound poor with real music. The obvious thing, it seems to me, is to test the amplifier with real music and real loads, and measure the deviation from "the straight wire with gain". There may be technical challenges that caused people historically to resort to frequency domain testing, but I think that we now have the capability to do it properly in the time domain. The key, I believe, is that there are certain deviations from "straight wire with gain" that people assert are inevitable but harmless, but which prevent you from making meaningful time domain measurements.
My idea would be to drive the device under test with real music and/or 'difficult' synthesised signals into a real load, and simultaneously sample the input and output. I would then apply a series of invariant, predictable processes to the output signal to null out errors between the two signals, starting with a gain modification, then time delay (both harmless), then frequency-dependent phase shift (arguably harmless) as necessary. It would then start to get interesting, as we might find we have to apply 'pre-distortion' to null out the effects of a clipping-style characteristic, or 'peaking' characteristic. We might begin to notice dips in gain or other 'blips' e.g. ringing, following sharp transients. We could quantify the deviations, classify them, and perhaps boost them to listen to their audible offensiveness against the desired signal (using headphones and a known reference headphone amplifier, say). This would not be possible using frequency domain-style measurements. Related to this, there is also the possibility of coming up with a pre-distorting system that perfectly corrects the amplifier, in which case simply build it into the amp! But presumably in the real world, distortions will drift with time and temperature, and multiple tests would reveal the extent of this.
The results would still be open to some interpretation of course, but the hope would be that the best amplifiers were so close to perfect that they could be pronounced as 'good' unambiguously. Amplifier designers would be incentivized to build amps to pass the tests, but the tests would be so realistic and representative of real audio, that the result must be a better all-round amp rather than something just designed to be good with a few specific signals.