Blind test protocol inquiery

flez007 · Sep 25, 2010

Hi all... I am looking to conduct next week a blind test with some friends here to express an opinion on differences between my Digital Server source (iMac, Puremusic) and my Oppo Nuforce Parts Connection CD player.

I have all my music ripped in WAV format so it would be quite easy to play maybe five different songs in an A/B/X round robin and each participant can record its opinion. Both sources feed a Reimyo DAC to the rest of my system.

I was wondering if there is s blind test protocol or blueprint to follow thru and perform it accordingly.

Thanks in advance

Fernando

amirm · Sep 25, 2010

A few quick thoughts:

1. Have duplicate material and a way to switch quickly between the sources to make comparison fast and easy. Practice on how to sync up the two streams. This is not always easy to do.

2. If you have access to audio editing software, try to impair one of the clips somewhat. You can roll of the highs, change level, etc. and then see if the participants can pick that one out reliable. If they cannot, then no reason to continue

.

3. ABX is hard to do. Unless you plan to publish your results, I would opt for normal A/B test where one is the music server and then other Oppo.

4. You need a way to equalize levels. This is hard to do without instruments because it needs to be right to a tenth of db. Perhaps use a free PC app your line input to check. If the levels are different, then you need a way to equalize them. Using a digital processor to do that would interfere with the job. So ideally, you would have a pre-amp with adjustable levels (and one that doesn't add noise).

5. Increase the chances that a difference is found. I would use quiet transients and notes that decay to silence.

Have I discouraged you from running the test yet?

flez007 · Sep 25, 2010

Not at all Amir (thanks for the lines).

I also anticipated some issues regarding equalization of volume settings, since the iMac/puremusic source is at least 3dB less than the Oppo.

I agree on the AB vs ABX, I already selected 5 pieces which I know pretty well and represent the kind of music I hear, not the rest of the listeners neccesarily.

Good point on #2, all this guys have great equipments and I think good ears, but all are biased

, is 5 a good number?, should I include myself in the group?

What about sticking just to acoustic music, is that a good aproach or should I include electronic/processed material?

Thanks for your feedback

Orb · Sep 26, 2010

amirm said:
A few quick thoughts:

1. Have duplicate material and a way to switch quickly between the sources to make comparison fast and easy. Practice on how to sync up the two streams. This is not always easy to do.

2. If you have access to audio editing software, try to impair one of the clips somewhat. You can roll of the highs, change level, etc. and then see if the participants can pick that one out reliable. If they cannot, then no reason to continue .

3. ABX is hard to do. Unless you plan to publish your results, I would opt for normal A/B test where one is the music server and then other Oppo.

4. You need a way to equalize levels. This is hard to do without instruments because it needs to be right to a tenth of db. Perhaps use a free PC app your line input to check. If the levels are different, then you need a way to equalize them. Using a digital processor to do that would interfere with the job. So ideally, you would have a pre-amp with adjustable levels (and one that doesn't add noise).

5. Increase the chances that a difference is found. I would use quiet transients and notes that decay to silence.

Have I discouraged you from running the test yet?

Question for you Amir.
Is it really important to equalize loudness between the two products if identifiable tells in terms of loudness do not matter?
I am curious because I know some research shows how our preference can be for louder (also in terms of preference it can also be lower if something is too loud).
However these research from what I could tell did not have the defined research scope to differentiate for a listener between loudness preference and loudness quality (in terms of quality no difference but the listener just wants to play it louder or quieter).

I am saying this because myself (trained for tone and loudness years ago by a friend in LSO) and friends (musician family) do not associate loudness with better sound quality, just at times that we may prefer a different loudness (but see both being equal sound quality).

In my anecdotal experience I can tell most times down to 0.3db for loudness changes with music.
But I do wonder if maybe the ability to differentiate between loudness preference and loudness affecting sound quality (so it does make it perceive the louder is superior sound quality) might come into affect when the differences are below sound level changes we can identify (so in my case as an example say 0.1db to 0.2db).

So, has any of the research looked at the possibility that while they correlated loudness and preference, it may be different to our perception of sound quality.
This is an interesting point when you look at how many post in various forums stating that the louder product will always be chosen, but this may be assuming a listener links both loudness and sound quality that from my own experience and friends is not necessarily true.

That aside, I appreciate that with studies equalised levels is critical to stop consistent identification of either A or B, that would skew results due to perception and some always selecting while others occasionally.

Anyway coming back on subject for you flez and sorry to go off topic slightly with Amir but it could be interesting.
In the more home orientated ABX (which I feel may still not be applicable in this case) listening test, I have seen one scientist on DIYaudio say its ok to zero the sound but critically to have the one changing equipment not be seen (so behind screen or in room at different times).

I must admit I am a bit leery about ABX tests as IMO to get any accurate results you need an incredible amount of time even if the listener feels they are comfortable (days instead of a few hours to do 12 tests I feel) and they are usually more applicable for specific engineering or research scenarios (not as shown on some other forums if two products sound the same or in general audio attempting to identify A or B).

Also this may help with your fun (and they can be fun) evening another good consideration when doing your listening tests is deciding how to judge what your listening to.
By this I mean instead of trying to specifically identify A or B or what differentiates them in terms of sound heard that can be really hard unless you know exactly what your listening for and using the right material, you could go with preference (as long as appreciating loudness can affect this) and behaviour (does one cause you to fidget or flick songs or chatter or lose interest,etc).
The loudness preference can be done with zeroing loudness or if you feel that still is not good enough then have the person doing changes control the loudness and alternate which is louder.
So you could make notes on each listen what you feel is noticable in terms of sound, but also note preference-behaviour as well.
And let us all know how it went

I will post Stuart Yaniger's suggestion he gave a member at DIYAudio in a seperate post, while it does not show it in this post he feels zeroing loudness is fine (for a more loose home test that is still considered anecdotal), but you could do half the tests where it is zero's and you control it, with the next half set by an unseen person who alternates which is louder in each listen test.

And as mentioned, ABX is only one of many procedures-processes relating to controlled sensory testing, as shown there are others that can be done that make it more fun/easier for home and not requiring such extensive time training and listening sessions.

Edit:
One thing also worth considering is the negative test (just a couple) where there are no changes, so instead of A-B it is A-A.
To keep the listeners on their toes it may be ok to tell them that this may happen.
Thanks
Orb

Orb · Sep 26, 2010

SY said:
Here's the test protocol I suggested to another person who claimed to be able to hear differences in wires unexplainable by conventional engineering and demanded to know how to do a controlled test. His response was, "I don't think I want to change the way I do things now." If you want the results to be accepted, the person assisting in the test should preferably be a disinterested party with experience in test protocols or an actual skeptic. The key sheet and score sheet comparison should be done by a third party. Your scoring should not be communicated in any way to the person doing the swapping during the test. The Clever Hans effect is a very powerful one- a whole community of magicians (so-called "mentalists") make their living from it.

This is a copy of my email to him:

Let's say you want to compare two interconnects. Call them A and B. Have someone generate a random table of A's and B's. The best way to do this is to have the person flip a coin 12 times- for each heads, he writes "A," for each tails, he writes "B." You are nowhere in the vicinity when any of this is going on. It's very important that this sequence be random- sometimes there will be a change between trials, sometimes not.

OK, now the fun begins. Swap the interconnects back and forth, sighted, until you think you have a good handle on what you think the differences are. You need to blind things now. Either have the cable swapping done behind a screen or leave the room after each trial. To prevent accidental non-auditory cuing or other variables, even if (say) trial 3 and trial 4 are both A, the A cable should be removed and reattached.

You now have two options for the data acquisition. You can either score "A" or "B" for each trial, or you can score "same" or "different" from trial to trial. If you choose to do the latter method, the random sequence should have A refer to keeping the same cable, B referring to changing the cables, rather than A being on cable, B being the other. Again, all swapping should be done with you out of the room or behind a barrier.

You keep a score sheet, then when the test is done, compare it to the random sequence. Typically, you'll want to achieve a better-than-95% confidence (or as a real sensory guy would say, n<0.05), which will generally mean 9 out of 12 correct.

As an aid, you should be able to leave the area at any point during the test, have the cables both removed, then repeat the sighted comparisons until you're ready for the next blind trial. Likewise, you should be able to control volume, length of audition, musical choice, or anything else you think will help you in identification. It is important that the person swapping the cables not be in the area during your listening or look at your scoring, or have any communication with you- it's very easy to have subconscious cuing upset the controls. They should do the swap (or remove and replace) and leave the area before you enter or be behind the barrier and out of sight during the entire test.

Yeah, doing an actual controlled test is a pain in the *** and not as much fun as playing audiophile, but the data you get will be valid and repeatable.

Click to expand...

An alternate way to do this is to have A and B represent "change cable" or "keep cable the same" instead of "cable A" and "cable B." You still want to have the cable plugged and unplugged even if there's no change, just to prevent a nonauditory cue. It is VERY important that during the duration of the test, you and the person swapping the wires not have any direct contact.

Stuart Yaniger is a moderator at DIYAudio but importantly is a scientist involved in making of industrial cables while also in past being involved in the creation of scientific measuring equipment, and also likes wine and has various patents.
His take is that in many/most cases there should be no audible differences in products, especially cables bah I say

But with his background it is something to take note of, even if not entirely agreeing with him.

Cheers
Orb

flez007 · Sep 26, 2010

Orb - I was indeed looking to get mostly a pick on preferences for all five tracks inclusing some rationale behind it (short words). I am adding some of your recommendations to perform the test, and yes, we are also looking to have some fun!.

amirm · Sep 26, 2010

flez007 said:
is 5 a good number?, should I include myself in the group?

Yes on both counts. Again, your are not aiming for AES publication status

. So anything that would give you high confidence would do. If you don't take the test yourself, then you won't have a feel for what worked, and what didn't for the next time.

What about sticking just to acoustic music, is that a good aproach or should I include electronic/processed material?

Acoustic music would my choice. Electronic music can also be used but it is harder since point of reference is not there.

Thanks for your feedback

My pleasure. Hope to see the results here when you are done with it!

amirm · Sep 26, 2010

Orb said:
Question for you Amir.
Is it really important to equalize loudness between the two products if identifiable tells in terms of loudness do not matter?

Yes it is. For two reasons:

1. If you want the test to be blind, then you have to remove all clues that tell the tester which one is which. Think of this less than obvious scenario (which is VERY typical when differences are small). Tester hears better fidelity for Clip A in two out of three tests and notices whether that one was the louder one of the two. In the third test, he can't tell the difference. He thinks to himself then that maybe his ear isn't good enough but surely the one that was better two out of three times, would still be better in this one too and votes for that clip! Therefore, a bad vote is cast.

2. The ear's response is level sensitive and hence, you are able to hear things that you would not at a different level. Let's look at Fletcher-Munson hearing sensitivity graphs:

Each graph represents how the ear hears different frequencies at a fixed level. As you see, the shape changes as the levels change. Of note, as the level goes down, you lose more of your high and low sensitivities. So if someone is looking for low level high frequency detail for example, it is more audible at higher level.

I am curious because I know some research shows how our preference can be for louder (also in terms of preference it can also be lower if something is too loud).

However these research from what I could tell did not have the defined research scope to differentiate for a listener between loudness preference and loudness quality (in terms of quality no difference but the listener just wants to play it louder or quieter).

You are correct that most people mention the effect of louder is better. But per above, there is more to it than just that. So for people who can get past the loudness difference, there is still an issue of ear response which in my experience tends to be impossible to rule out.

amirm · Sep 26, 2010

BTW, here is a way to get around having to level match some of the time. It eliminates the need for any extra equipment but doubles the test time.

Run the test without worrying about which side is louder. Gather the results. If the source with lower volume wins, then assuming the issue with false voting did not occur, then you are golden since I am not aware of any theory that explains why lower volume would increase preference.

The job gets more difficult if the louder the source wins. Now you don't know why that happened. To get past this (to some extent), repeat the test, this time by turning up the volume a notch on your pre-amp to make the weaker sound clearly louder. See if preference shifts. If it doesn't, then you are golden as now, the louder one is losing yet again. If preference shifts, then you need to perform level match

.

Seeing the hundreds of posts I went through in another forum when I suggest the above

, let me say in advance that again, the above method does not rise to level of scientific publication. But for home test, you will be miles ahead of everyone else. And you will have an improved setup compared to anyone who sticks other gear in the middle, and hence has to worry if that gear didn't dominate the fidelity of the whole system.

mimesis · Sep 26, 2010

Hi,

First time to post. When I have a chance I will write up a little about myself. As for this thread. I recommend that you provide a questionnaire or survey that will allow all respondents to answer the same set of controlled items. You might use a Likert scale (1-7) and use such things as:
image relative to perceiving listening position (1 = you are very close, 7 = you are very far away); ambience; transient response etc. You might also have some open ended questions. You should control for the time exposure. Also, control for where people sit (same places both rounds). No food, drink or reading material during - you need to control attention. Respondents should not talk. The equipment should also not be introduced with any advance predictions so as not to prime anyone.

Take a break during the switchover and offer a distraction task (a sheet of paper with some random tasks for people to do, math problems, random stuff) this resets their cognitive base. After the experiment is done, then people can talk. We don't want people to influence each other.

Also, you shouldn't play the stereo in advance for any period so as not to create a biased starting point. Have everything tested and ready to go the night before so that things will be turn key upon arrival. Think of your target as a scientific experiment and not a get-together. Imagine if you were critiquing your process, what variables would you realize you did not control for (all telephones off, including the house phone?). Beyond your results, the setup should be reproducible, if it is not, the results cannot be replicated.

During feedback/discussion you could 1) record the session for later analysis and 2) proceed to talk about each area in turn. This is to prevent one person from dominating. Remember that in discourse, people may deliberately or unconsciously influence hearers and thus push the trajectory in one direction. This is the intrinsic nature of communication dynamics and differs from the survey portion. It would be interesting to be able to compare the standardized results with the discussion to see if there is divergence or convergence (from the two types of feedback and also in the group's opinions).

I would avoid using qualitative normative terms in the survey such as "better" or "worse" as they are vague and each person has his or her own scale and you are trying to study what the perceived differences are not the preference (unless you want to study preference in which case you can then run a battery for that and run a regression (if you use continuous scale measures otherwise if you use binary measures, then probit or logit) that correlates sonic qualities with subjective ratings).

The other logistical issues as mentioned above (level matching etc.) should also be controlled for. This means also that the volume control should be fixed for all tracks. Make sure that one track is not too loud compared to another when set at the same reference point. If you use classical vs rock this may be an issue.

Good luck!

William

I should also add that in my encounters with the high end audio press and psychological testing, I have been miffed at the lack of prior minimal erudition regarding the relevant cognitive/psychology literature and the scientific method. In fact most reviews would likely benefit by more probing editing and epistemological self-appraisal.

Orb · Sep 26, 2010

Amir thanks for the info, yeah the cues was one area I can definitely see but this to a certain extent can be overcome as you mention.
Regarding the Fletcher chart this is what has been niggling me (been thinking of it as well), because if one can hear a difference in DB levels then from my experience it is possible to identify why one sounds better (possible missing info due to being too quiet but can tell the two sources are identical,etc).
From my own experience and even trying to convince musician friends the louder one is better it has never worked; so in our cases louder being better did not match up, louder being preferred yeah but all could understand that there was no other difference even with negative test.
This is why I am wondering just how much the Fletcher-Munson curves relate to physical hearing bias and potentially loudness preference but also importantly does it really apply in the scenario of louder is also better (and this can be made complicated by the definition of better; in our case even if we notice certain information is missing due to being too quiet for us this does not mean the other is actually better but just one is too quiet for all the musical notes-sounds).
The key point I guess is differentiating between preference due to loudness and cognition in sound quality and comparable differences.
I have only seen one study involving trained professionals and loudness levels of A and B, but in that test they identified an A/B bias that while could be seen still had to be investigated on what controlled/triggered the mechanism lol, so not really of much use.
I just wonder if louder is also perceived as better (e.g. listener thinks it is different source and not just loudness) when it occurs at a sound level difference the listener cannot perceive.

I appreciate this is just waffle as I have not seen any papers specifically identifying both preference and accurate loudness perception for music, which may well exist.
If you know of any studies that did this I would appreciate being given a heads up, or even if your own experience is that listeners cannot tell the only difference is loudness and both in terms of sound quality are identical.

Another reason that "may" (need to stress that I am not saying this is a fact) suggest it is possible to train for tone-pitch and loudness is studying musicians in an orchestra; they can be pitch/phase locked/perfect both with extremely loud and quiet music passages, and both will be equal in terms of sound/musical quality.
Possibly a long stretch to rationalise it I agree haha, so any info/papers would be interesting as I would like to truly know if those assuming louder is always better is correct, bearing in mind I notice they mention soundstage/music weight-depth are all part of the loudness equation if differences are perceived (again never experienced this but never tried to test it with music where loudness was not noticed).

Thanks and hope I have not made this confusing, late so being a lazy poster here doh

Orb

Hope this makes sense.

mimesis · Sep 26, 2010

I am not so sure about the amplitude manipulation that Amir mentions. One, you may instigate listener fatigue. Two, to test for preference in a controlled manner, there are better approaches.

To control for preference, you should have an independent survey item for each unit. How much do you like...(the sonic playback qualities of this unit on a scale of 1-7 with 7 like it incredibly well). Since you cannot know your comparative preference until after the whole test, you cannot have a measure that puts item A at one end and item B at the other. There are various problems with the temporal separation and it would only make things worse to run a third round as it would introduce asymmetry of exposure, overexposure and perhaps involve uncontrolled fiddling of the volume, episodes of which may wreck an otherwise smooth trajectory.

If you use a mutually exclusive measure in each round, you can then run a correlation test or even better, a multiple regression and see how preference relates to sonic qualities. You might also determine on a more granular level how the two audio items are alike and different on the basis of these attributes (using something like factor analysis). This is the value of statistical analysis.

As for loudness, you can also put a test item that asks the listener which unit seems to have been louder in general (7 point scale). This could be the last test item. Then you can see whether people converge on the loudness, since they all had the same experience. From that, you could analyze whether loudness is correlated to preference in THIS experiment. Keep in a mind that you cannot cherry pick your analysis, a multiple regression should involve all known independent variables, imaging, transient reproduction, noise floor, and loudness. The results should show which factors if any were of greater influence on the outcome.

Orb · Sep 26, 2010

Mimesis,
well 6 tests instead of 5 could be done, with 3 of the listening tests with A loud and the other 3 with B louder, so if loudness is affecting preference it then should balance out without providing a positive statistical significance apart from highlighting loudness caused both A and B to be chosen equally (if that happens).
Whats your thoughts about zeroing the volume and have an unseen controller managing volumes (with screen covered if it shows values) and turning it up/down at the listeners request?
Cheers
Orb

mimesis · Sep 26, 2010

Hi Orb,

Rather than cause more confusion, perhaps someone (the original poster) should summarize what he feels will be his actual protocol. Then we will all have a reference point. If not, we may end up creating more listening tests than necessary. For instance, do we know how closely the levels can be matched? Does the experimenter have a preamp that is granular to within .5dB or better? Once we know the constraints we can then proceed with helping to construct a template experimental design.

My concern with the ghost in the machine controlling the volume is when is the change made? On the fly with random ups and downs depending on random listener requests? Perhaps turning it up for one will disturb the focus of another?

My take would be: have an even number of participants, split them into groups. Reverse the order in which each group is exposed to the stimulus i.e. AB and BA. Keep all gear blind, including order. They may know the device under test in general but not during the experiment. To keep things simple, do not interchange playback between devices so that each round is dedicated to a given machine. The others can hang out in another room outside, in the garage, not show up etc. !

Have a survey prepared with perhaps 10 questions. 1-8 cover sonic attributes in tangible terms, using a 7 point Likert scale. Item 9 can be a question on how much you like the sonic playback of the device on a scale of a 1 to 7. If you are not able to keep the signal level within a half dB or better, a follow up question, after both devices have been heard, can be offered. This one would simply ask: based on your recollection, was unit A louder or unit B louder (7 point scale. A much louder somewhat louder, slightly louder, the same, B slightly, somewhat, much louder). What we may learn is the ability to perceive the loudness difference (in theory, if it audible, the results should converge for everyone). We may also discover that the signal matching was close enough (I think you can get the gear matched to 1 dB, which may be audible - the test will tell us anyway).

There can be fields for explaining in greater depth what people heard when they completed the survey. These remarks are the starting point for what people discuss afterwards over wine and cheese (or beer and wings). I nominate a good bottle of wine to make things enjoyable and foment discussion.

You might control for telephone ringing, the room temperature (will the AC need to be on and be noisy or off and people are too hot), seat position (same for each person for each trial, chosen randomly perhaps). You should also control for respondent age and gender (if there are women, ideally one in each group) and perhaps musical training (1 - have you had formal musical instruction and training : none, some years ago, some recently, a lot etc.) and music appreciation (1: how often do you listen to music; 2: how loud do you like to listen to music?). These will capture the respondent's prior experience, possible trained ear, hearing damage, loudness preference. After all, as we get older, we hear more poorly. If we blast music all the time, either we have made ourselves a little deaf or we prefer loud music (for the near visceral impact). Whatever the case, it would be interesting to know the manner in which preferences play out. It is not enough to run a test and find out people like A over B. We want to know why. And we hope to know that the why is generalizable so that we would assume that it would apply to other gear. This is the point of inferential statistics. Else we might just assume that everything is hunky dory and to each his own (a very postmodern thing, by the way).

William

PS
I played in my county and high school orchestras (a long time ago) and the thrill of being in the action was intense. If I had continued (barring talent of course) I might be both trained, perceptive of minutiae and deaf. As it is, the mosquito test for high frequency has me worried that I've already lost 16 kHz+ a decade too early!

flez007 · Sep 26, 2010

Thanks Mimesis, there are great tips within your post, I was worried of some of the possible interruptions and war of personalities during the test, since I know them all for many years , and some of them influence in the others, tend to anticipate conclusions or wave wows and oopps during listening sessions.

Some others also dislike the Digital server route and form value judgements without even listen to the thing... But all of them have gifted listening (better than mine) so I am sure it will be fun to run the test.

Based on all your comments, now I know this will take me more than a few days to prepare the session, so I will postpone it for one more week ( also the quarter will be over )

so I will have more time to focus on this venture.

Orb · Sep 27, 2010

mimesis said:
Hi Orb,

Rather than cause more confusion, perhaps someone (the original poster) should summarize what he feels will be his actual protocol. Then we will all have a reference point. If not, we may end up creating more listening tests than necessary. For instance, do we know how closely the levels can be matched? Does the experimenter have a preamp that is granular to within .5dB or better? Once we know the constraints we can then proceed with helping to construct a template experimental design.

My concern with the ghost in the machine controlling the volume is when is the change made? On the fly with random ups and downs depending on random listener requests? Perhaps turning it up for one will disturb the focus of another?

My take would be: have an even number of participants, split them into groups. Reverse the order in which each group is exposed to the stimulus i.e. AB and BA. Keep all gear blind, including order. They may know the device under test in general but not during the experiment. To keep things simple, do not interchange playback between devices so that each round is dedicated to a given machine. The others can hang out in another room outside, in the garage, not show up etc. !

Have a survey prepared with perhaps 10 questions. 1-8 cover sonic attributes in tangible terms, using a 7 point Likert scale. Item 9 can be a question on how much you like the sonic playback of the device on a scale of a 1 to 7. If you are not able to keep the signal level within a half dB or better, a follow up question, after both devices have been heard, can be offered. This one would simply ask: based on your recollection, was unit A louder or unit B louder (7 point scale. A much louder somewhat louder, slightly louder, the same, B slightly, somewhat, much louder). What we may learn is the ability to perceive the loudness difference (in theory, if it audible, the results should converge for everyone). We may also discover that the signal matching was close enough (I think you can get the gear matched to 1 dB, which may be audible - the test will tell us anyway).

There can be fields for explaining in greater depth what people heard when they completed the survey. These remarks are the starting point for what people discuss afterwards over wine and cheese (or beer and wings). I nominate a good bottle of wine to make things enjoyable and foment discussion.

You might control for telephone ringing, the room temperature (will the AC need to be on and be noisy or off and people are too hot), seat position (same for each person for each trial, chosen randomly perhaps). You should also control for respondent age and gender (if there are women, ideally one in each group) and perhaps musical training (1 - have you had formal musical instruction and training : none, some years ago, some recently, a lot etc.) and music appreciation (1: how often do you listen to music; 2: how loud do you like to listen to music?). These will capture the respondent's prior experience, possible trained ear, hearing damage, loudness preference. After all, as we get older, we hear more poorly. If we blast music all the time, either we have made ourselves a little deaf or we prefer loud music (for the near visceral impact). Whatever the case, it would be interesting to know the manner in which preferences play out. It is not enough to run a test and find out people like A over B. We want to know why. And we hope to know that the why is generalizable so that we would assume that it would apply to other gear. This is the point of inferential statistics. Else we might just assume that everything is hunky dory and to each his own (a very postmodern thing, by the way).

William

PS
I played in my county and high school orchestras (a long time ago) and the thrill of being in the action was intense. If I had continued (barring talent of course) I might be both trained, perceptive of minutiae and deaf. As it is, the mosquito test for high frequency has me worried that I've already lost 16 kHz+ a decade too early!

Blast I knew I should had clarified the volume but it was late and I thought I am off hehe.
As you rightly mention about volume where half tests have A louder and half have B louder, both the two volume settings would have to be constant so the quieter one may be -35db while the louder one may be -30db (appreciate these mean nothing but just an example), if the preamp has no value display then some bluetack or something to mark both positions for volume.

The problem though is that if the preamp has no visual display, then the two different volumes could vary slightly and possibly add confusion.
This is why I mentioned that the volume is zero'd and can only be controlled by an unseen person who adjusts at the request of the listeners (so not random), this relies upon a remote control I guess as the person needs to be distant and behind the listeners.
In the light of day that does look rather complicated

Expanding a bit more on earlier thoughts.
Relating to behaviour and notes yeah totally agree, and maybe I was too brief in previous posts mentioning the listener needs to note both preference and what they hear.
The risk of noting what is heard though is deciding on the vocabulary used, going too subjective can cause confusion so I would suggest there is agreement on a set of subjective descriptions/comments to be used, and like mimesis mention a good idea is having set values for preference.
Using a Harman Kardon study as an example they had the following preference setting;
1 Strong Dislike, 3 Dislike, 5 Neither Like/Dislike, 7 Like, 9 Strong like
For subjective descriptions this is going to be tough but could include some of the HK ones used in a study - furthest right is better; Colored, Harsh, Thin, Muffled, Forward, Bright, Dull, Boomy, Full, Neutral
And then could also note the behaviour as I briefly touched on, so define a set around or may include;
attention wandering-not interested, causes fidgeting, want to flick through music and end quicker, urge to keep on listening and play longer-more, relaxed, interested and focus drawn to the music.

Whatever you use, I feel the key is to ensure that you go from worse to better, and what you use is a reflection of its opposite to some extent.
As a quick example look at the HK subjective description, in their case the worse one is Colored while the best is Neutral.
Bear in mind the HK comments shown relate to speaker-room and some you will want to change to reflect digital source.

Ah just so you know Flez, the reason 12 tests is mentioned by Sy is that for a statistical meaning you would require 9/12 tests to come up the same.
So for a clear preferred winner that would be the goal, but there is no need to do that as your not really trying to make a factual point; e.g. look at those who moan on AVSForum about such tests lol

Not sure how Amir keeps his sanity there and glad I just lurk to read the rational (bloody rare amount of posters) vs irrational (amazingly most including those who swear to be objectivists)

So if you post your experience on various forums just be aware the response you receive will not be the same

Will be interested to know how this goes, including what those involved felt about it.

Cheers
Orb

flez007 · Sep 27, 2010

These could be the game plan then:

This week:

- Seect tracks to compare in both digital sources (5 is a good count?)
- Equalize and record volume differences with an SPL meter since each track might have different thresholds.
- Write down and distribute rules for panelists, this should include among other things a the 1-7 coding graph for up to five variables including subjective notes for each one, this will weight 50% of the total count, the other 50% would be a binary decision on which of the options sounds " better". No drinks, smoke or talking during the test jejeje (this will be hard).

During the test:

- seating ready, gear warmed up ( it is tubed based )
- start fast, 2 minutes per track in AB, AB, BA, BA, AB order for each trach adjunsting volume after each track and leaving a 3 min interval time to discuss, unwind and relax - no comments of ranking will be allowed.
- collect data and share preliminary findings

After test

- Guacamole and chips, bacardi and tequila/ mezcal for all
- play my vinyl gear to cry all together

- line-up a cable blind test session in 15 days - unbalaced vs balanced.

Looks fine?

-

microstrip · Sep 27, 2010

Mimesis, your contributions to this thread are valuable, mostly in that they point out that audio is also a psycho acoustical science and has its own methods. But you are also showing that, in practice, valid blind tests are beyond the possibilities and capabilities of most audiophiles.

When you try to introduce the quantitative and qualitative measurements of the devices under study in these tests, the complexity is such that generalization is impossible.

I think it is why many brands have a typical sound - their designers valuate only a limited set of sound parameters and optimize their designs within the limited range of the listening tests and measurements they can physically carry, where they focus mainly on their sound priorities.

Should we consider that just writing and validating a decent protocol sheet for blind audio tests will take more time then any consumer audiophile can afford for the hobby in the next ten years?

Orb · Sep 27, 2010

Heya Microstrip,
bear in mind though it depends what your trying to do.
In this case it seems to me it is just a blind/DBT audition so it is possible to do this at home within a specific time period.
The HK example for both preference combined with controlled and defined descriptions/comments (showed them in previous post and the importance of their structure) was designed by Sean Olive with four other senior members of staff, so it is possible to correlate a listeners description and preference to objective measurements.
Those I used of the HK were from his study of a listeners preference towards room correction products and without.

Where there is a real pig of a problem IMO is trying to setup an ABX with the intent of defining a fact; such as do amps all sound the same, do cables sound different, etc.
From one audio test I know (that is interesting because it had both listener pass and fails) that wanted to stop the cable arguments on their forum setup a cable listening session with reasonable equipment in a demo room.
They intended to only do 10 tests (really should had been 12), but what happened is that they only managed 6 tests in around 8-12 hours lol.
One listener left on the 1st day, the listener who managed 100% (6/6 including passing the negative test) decided it was a freaking pain in backside to do true DBT and wished he had not proposed the exclusion of a switchbox (in this case they had to leave the room while the person changed the cables)

What adds to the time is ensuring all understand and happy with the setup and procedure while importantly allowing the listeners an indefinite number of times to listen to both ABX to ensure they are happy with their selection, this added an incredible amount of time I would say.
However the result was one did get 100% while the other two who stayed only managed like 50-60%, and the one who left after just say 4 hours being grateful he got out of it I am sure haha

Flez, this is important in your auditioning.
Make sure that the AB can be repeated as often as requested (within reason as this is for fun) for each of your 5 or 6 listening tests.
And appreciate how much time may be needed to run through the process before actual testing with those coming over, including how and what must be filled in (preference value and defined subjective comment/description).

Also I just want to mention that Amir is worth taking note of as he was involved in managing and also sitting in DBT sessions, while also having some notable audio people working for him in the past when he was at Microsoft, so with him and also Mimesis it looks like you have good council

Cheers
Orb

amirm · Sep 27, 2010

I forgot one other very helpful thing: becoming "over trained" with the test content. It takes a while for people to figure out what to listen for. In formal testing, we often let people listen to as many clips as they like without scoring. Only when they are ready does the testing start. Another option is to give the test material in advance to the testers. That way, they can listen to them on their own systems and be familiar with it.

I can't find the links now but research has shown that over training is very helpful in boosting the accuracy of testers. This is why much of MPEG testing occurs with the same set of clips as testers have really become familiar with what to listen for (although such a practice has its own side effects).

The above also takes away some of the stress of blind testing since you are not sitting with anticipation of what is thrown at you. And gets rid of the excuse after the fact that you picked bad material

.

But wait, there is more!

Total test time should not be more than 20 minutes. People get tired and bored even at this level and start voting out of frustration just to get through the test. So if you are playing 5 clips twice, that is 10 clips. Each one then should be 2 minutes.

OK, one last thing

. Run the tests yourself non-blind. If you can't hear a difference, I am not sure there is a point to run it past others. Keep going through different material until you do hear a difference. Then use that clip. In other words, you need to create a valid hypothesis of a difference existing and then test that with broader group. Now, if you don't trust your own ears as being good, then you can skip this step but I suspect that isn't the case.

Blind test protocol inquiery

Member Sponsor

Banned

Member Sponsor

New Member

New Member

Member Sponsor

Banned

Banned

Banned

New Member

New Member

New Member

New Member

New Member

Member Sponsor

New Member

Member Sponsor

VIP/Donor

New Member

Banned

Similar threads