Hi Orb,
Rather than cause more confusion, perhaps someone (the original poster) should summarize what he feels will be his actual protocol. Then we will all have a reference point. If not, we may end up creating more listening tests than necessary. For instance, do we know how closely the levels can be matched? Does the experimenter have a preamp that is granular to within .5dB or better? Once we know the constraints we can then proceed with helping to construct a template experimental design.
My concern with the ghost in the machine controlling the volume is when is the change made? On the fly with random ups and downs depending on random listener requests? Perhaps turning it up for one will disturb the focus of another?
My take would be: have an even number of participants, split them into groups. Reverse the order in which each group is exposed to the stimulus i.e. AB and BA. Keep all gear blind, including order. They may know the device under test in general but not during the experiment. To keep things simple, do not interchange playback between devices so that each round is dedicated to a given machine. The others can hang out in another room outside, in the garage, not show up etc. !
Have a survey prepared with perhaps 10 questions. 1-8 cover sonic attributes in tangible terms, using a 7 point Likert scale. Item 9 can be a question on how much you like the sonic playback of the device on a scale of a 1 to 7. If you are not able to keep the signal level within a half dB or better, a follow up question, after both devices have been heard, can be offered. This one would simply ask: based on your recollection, was unit A louder or unit B louder (7 point scale. A much louder somewhat louder, slightly louder, the same, B slightly, somewhat, much louder). What we may learn is the ability to perceive the loudness difference (in theory, if it audible, the results should converge for everyone). We may also discover that the signal matching was close enough (I think you can get the gear matched to 1 dB, which may be audible - the test will tell us anyway).
There can be fields for explaining in greater depth what people heard when they completed the survey. These remarks are the starting point for what people discuss afterwards over wine and cheese (or beer and wings). I nominate a good bottle of wine to make things enjoyable and foment discussion.
You might control for telephone ringing, the room temperature (will the AC need to be on and be noisy or off and people are too hot), seat position (same for each person for each trial, chosen randomly perhaps). You should also control for respondent age and gender (if there are women, ideally one in each group) and perhaps musical training (1 - have you had formal musical instruction and training : none, some years ago, some recently, a lot etc.) and music appreciation (1: how often do you listen to music; 2: how loud do you like to listen to music?). These will capture the respondent's prior experience, possible trained ear, hearing damage, loudness preference. After all, as we get older, we hear more poorly. If we blast music all the time, either we have made ourselves a little deaf or we prefer loud music (for the near visceral impact). Whatever the case, it would be interesting to know the manner in which preferences play out. It is not enough to run a test and find out people like A over B. We want to know why. And we hope to know that the why is generalizable so that we would assume that it would apply to other gear. This is the point of inferential statistics. Else we might just assume that everything is hunky dory and to each his own (a very postmodern thing, by the way).
William
PS
I played in my county and high school orchestras (a long time ago) and the thrill of being in the action was intense. If I had continued (barring talent of course) I might be both trained, perceptive of minutiae and deaf. As it is, the mosquito test for high frequency has me worried that I've already lost 16 kHz+ a decade too early!
Blast I knew I should had clarified the volume but it was late and I thought I am off hehe.
As you rightly mention about volume where half tests have A louder and half have B louder, both the two volume settings would have to be constant so the quieter one may be -35db while the louder one may be -30db (appreciate these mean nothing but just an example), if the preamp has no value display then some bluetack or something to mark both positions for volume.
The problem though is that if the preamp has no visual display, then the two different volumes could vary slightly and possibly add confusion.
This is why I mentioned that the volume is zero'd and can only be controlled by an unseen person who adjusts at the request of the listeners (so not random), this relies upon a remote control I guess as the person needs to be distant and behind the listeners.
In the light of day that does look rather complicated
Expanding a bit more on earlier thoughts.
Relating to behaviour and notes yeah totally agree, and maybe I was too brief in previous posts mentioning the listener needs to note both preference and what they hear.
The risk of noting what is heard though is deciding on the vocabulary used, going too subjective can cause confusion so I would suggest there is agreement on a set of subjective descriptions/comments to be used, and like mimesis mention a good idea is having set values for preference.
Using a Harman Kardon study as an example they had the following preference setting;
1 Strong Dislike, 3 Dislike, 5 Neither Like/Dislike, 7 Like, 9 Strong like
For subjective descriptions this is going to be tough but could include some of the HK ones used in a study - furthest right is better; Colored, Harsh, Thin, Muffled, Forward, Bright, Dull, Boomy, Full, Neutral
And then could also note the behaviour as I briefly touched on, so define a set around or may include;
attention wandering-not interested, causes fidgeting, want to flick through music and end quicker, urge to keep on listening and play longer-more, relaxed, interested and focus drawn to the music.
Whatever you use, I feel the key is to ensure that you go from worse to better, and what you use is a reflection of its opposite to some extent.
As a quick example look at the HK subjective description, in their case the worse one is Colored while the best is Neutral.
Bear in mind the HK comments shown relate to speaker-room and some you will want to change to reflect digital source.
Ah just so you know Flez, the reason 12 tests is mentioned by Sy is that for a statistical meaning you would require 9/12 tests to come up the same.
So for a clear preferred winner that would be the goal, but there is no need to do that as your not really trying to make a factual point; e.g. look at those who moan on AVSForum about such tests lol
Not sure how Amir keeps his sanity there and glad I just lurk to read the rational (bloody rare amount of posters) vs irrational (amazingly most including those who swear to be objectivists)
So if you post your experience on various forums just be aware the response you receive will not be the same
Will be interested to know how this goes, including what those involved felt about it.
Cheers
Orb