Figure 1: Singer Frieda Hempel conducting a Tone Test at Edison Studios, NYC in 1918. Note that many of the listeners' ears are covered by the blind folds making it a double blind and double deaf listening test, since the experimenter Edison was deaf himself.
Recently I was asked how I could possibly prove or assert that listeners prefer accurate loudspeakers without having performed a live-versus-recorded listening test. This is a test where the listener compares a live musical performance to a recording of the performance reproduced through loudspeakers. The closer the sound quality of the reproduction is to that of the live performance, the more accurate the loudspeaker is deemed to be - at least in theory. In practice, these tests are usually ridden with so many uncontrolled listening test nuisance variables that the results are essentially meaningless. This article examines why live-versus-recorded listening tests are not suitable for serious scientific investigations of the perceived sound quality of recorded and reproduced sound.
Edison’s Tone Tests: “People will hear what you tell them to hear”
Thomas Edison was one of the first audio engineers to embrace live-versus-recorded demonstrations. In 1910, he invented the Edison Diamond Disk Phonograph, which he claimed had “no tone” of its own. To prove it, a series of road shows were given across the United States where about 4,000 live-versus-recorded demonstrations of his photograph were conducted in auditoriums. At some point during the live music performance there would be a switch over to the recorded performance, and apparently audience members could not tell the difference between the live and recorded performances.
After a 1916 live-versus-recorded demonstration in Carnegie Hall, the New York Evening Mail stated “the ear could not tell when it was listening to the phonograph alone, and when to actual voice and reproduction together. Only the eye could discover the truth by noting when the singer’s mouth was open or closed” 
By today’s standards, the fidelity of Edison’s disc phonograph was egregious in terms of its noise, distortion, limited dynamic range, bandwidth and frequency response (you can hear some of Edison’s recordings online here). It’s hard to imagine that listeners were fooled into thinking his Diamond Disk recording could not be distinguished from the live performance. In fact, we now know that Edison manipulated the tests to produce the results he wanted. First, he carefully chose the music and musicians to work within the technical limitations of his technology. Edison detested music with extreme dynamics, high tones, vibrato and complex textures because they were a challenge to his deafness and his Tone Tests. He selected and coached musicians to mimic the sound of their recordings to minimize the audible differences between live and recorded performances ,.
Secondly, Edison was the consummate audio salesman and was known to say, “People will hear what you tell them to hear” . The expectations and perceptions of his listeners were manipulated before the test to produce a more predicable outcome. Audience members were given a concert program before his Tone Tests that clearly told them exactly what they would hear, how amazing it will sound, and what an appropriate response would be:
“Those who hear this test will realize fully for the first time how literally true it is that Mr. Edison has made possible the re-creation of the artist’s voice. No more exacting test could be made to demonstrate that the New Edison actually does re-create the voice of the artist than to play it side by side with the artist who made the records. This is the final proof. Close your eyes. See if you can distinguish the voice of the New Edison from that of the artist. Did you ever believe it possible to re-create a voice? Note that the voice of the artist and the voice of the Edison are indistinguishable” [emphasis is mine] [ 3].
Figure 2: Another Edison Tone Test where biases related to sight and smell may have compromised the results based on the many listeners covering their noses. Did a bad case of singer's halitosis make it possible to identify the live performance based on smell alone?
Other Live-versus-Recorded Demonstrations
Following Edison’s live-versus-recorded demonstrations, other tests have been conducted by Harry Olson at RCA, and G.A. Briggs (Wharfedale) and Peter Walker at Quad in the 1950’s. . A common problem with these demonstrations was double reverberation: the reverberation of the room was heard both in the recording, and again when it was reproduced through loudspeakers in the same room. This made it easier for listeners to tell the difference between the recorded and live performances.
Acoustic Research's Live-Versus-Recorded Demonstrations
During the 1960’s, Acoustic Research (AR), an American loudspeaker company, performed over 75 live-versus-recorded concerts in cities around the USA featuring The Fine Arts String Quartet, and the AR-3 loudspeaker ,. To solve the double reverberation problem, the recordings of the quartet were made in an anechoic chamber, or outdoors. Outdoor live-versus-recorded demonstrations had the added benefit that there were no room reflections in either the recording or the live performance. This made the demonstrations less sensitive to off-axis problems in the microphones and loudspeakers. It also eliminated the challenge of capturing and reproducing the complex spatial properties of a reverberant performing space.
The AR demonstrations apparently generated an enormous amount of free publicity in newspapers and audio magazines where it was reported that the reproduction of the recordings was virtually indistinguishable from the live performance. AR sales increased dramatically, to the point where in 1966 AR apparently owned 32% market share of loudspeakers sold in the United States.
A Live-Versus-Recorded Method For Testing Loudspeaker Accuracy
Edgar Villchur, the head of Acoustic Research, to his credit, was a firm believer that loudspeakers should accurately reproduce the art (the recorded music) and not editorialize or enhance it. In a 1962 paper, he described a live-versus-recorded method for evaluating the accuracy of loudspeakers . The method used a reference loudspeaker (the live performance) that was placed in the listening room with the loudspeaker-under-test. The goal of the loudspeaker-under-test was to accurately reproduce a recording of the reference loudspeaker playing white noise in an anechoic chamber. The original white noise was also fed to the reference loudspeaker during the listening test. The more similar the loudspeaker-under-test sounded to the reference speaker the more accurate it was, at least in theory.
Villchur acknowledged that the sensitivity and validity of the method depended on the quality of the reference loudspeaker, its directivity, and the choice of program material, White noise was more revealing of loudspeaker inaccuracies than music. His reference loudspeaker consisted of a single 2-inch midrange from an AR-3 loudspeaker because he found using multiple drivers caused acoustical inference that was audible in the anechoic chamber, but not so audible in a reverberant listening room; these differences would produce errors in the listening test. One wonders how a tiny 2-inch driver could have produced adequate high treble and low bass without distortion. As such, these limitations would significantly limit accuracy and usefulness of this listening test method.
Another problem with this method was that the anechoic loudspeaker recordings were made at a single point in space, and did not capture the directivity and off-axis characteristics of the reference loudspeaker. Unless the speaker-under-test had exactly the same directivity and off-axis characteristics of the reference loudspeaker, it could never sound exactly the same in a reflective listening room. To compensate for these errors, Villchur used a trial-an-error process to find the best microphone position relative to the reference loudspeaker where the timbre of the anechoic recording best matched the timbre of the reference loudspeaker when placed in a room. Adjusting the recording to mimmic the sound of live performance is the reverse of what Edison’s musicians would do, but essentially it’s produced the same bias. (Edison would have been proud!)
Finally, it is not clear how Villchur controlled loudspeaker positional biases when comparing the reference loudspeaker to the loudspeaker-under-test. Loudspeaker positional biases have been shown to produce audible effects that can be larger than the audible differences between different models of loudspeakers  At Harman, these positional biases are eliminated via an automated speaker shuffler that places each loudspeaker in the same position of the room.
Summary of Problems with Live-versus-Recorded Tests
By today’s standards, the live-versus-recorded tests performed to date lack the necessary scientific controls and rigor to consider their results or conclusions accurate, repeatable and valid. Below are a few of the most significant psychological, physical, methodological or experimental listening variables that plague these types of tests. While it is possible to control some of these variables, others are either impossible, impractical or too expensive to control.
Sighted and Cross-Modality Biases
To date, most of the live-versus-sighted tests have been performed sighted, where non-auditory cues were available to allow the listener to identify whether they were hearing the live or reproduced sound source. These tests could have been easily made blind via an acoustically transparent curtain; however, scientific validity was apparently not the primary purpose of the test. The visual cues from the musicians (bowing, lip syncing) would also enhance the realism and presence of the reproduction, a well-known cognitive effect observed in research of binaural and virtual reality displays.
Listener Expectation, Authority Bias, Group Interaction Bias
In many of the public live-versus-recorded demonstrations, listeners expectations were manipulated by knowledge given to them by the organizers of the demonstrations. In some cases, listeners were told what the expected response should be before the test began (see Edison's concert programs above). In large groups settings, listeners' responses can be easily swayed by the opinions and reaction of other members in the group (a herd mentality), especially when an authority member is present. These biases are easily removed from live-versus-recorded tests by repeating the test for each individual listener. The live and recorded performances would have to be replicated for every listener, which makes the tests too difficult, expensive, time consuming, and impractical to use.
Qualifications of Listeners
None of the live-versus-recorded tests I've read about have reported the hearing and critical listening qualifications of the listeners who participated in them. These are important variables in the sensitivity and reliability of the test results, and can be easily quantified.
Live and Recorded Performances Must Be Identical
For live-versus-recorded tests to be valid, the live and recorded performance should be identical, having the same notes, intonation, tempo, dynamics, loudness, balance between instruments, and the same location and sense of space of the instruments. Otherwise, there are extraneous cues other than sound quality ones that allow listeners to readily identify the live and recorded version. Midi-controlled instruments (e.g. player pianos) are but one example of how this problem could be resolved.
Positional Biases from Live and Reproduced Sound Sources
Unless the live and reproduced (e.g. loudspeakers) sound sources occupy the same physical locations, the listener can always identify the live versus recorded versions based on the localized positions of the sound sources.
Errors in the Recording
The usefulness of live-versus-recorded methods for perceptual measurements of sound quality in the playback chain is severely limited by errors in the recording. The recording errors are not easily separated from the errors in the playback chain (see circle-of-confusion). Microphones and microphone techniques both contain errors that limit the timbral, spatial and dynamic accuracy of the recordings through which we judge loudspeakers. Apparently the most effective live-versus-recorded demonstrations were conducted outdoors - effectively an anechoic environment - where the off-axis performances of the microphones and loudspeakers, and the complex spatial cues of a reflective room were largely removed as factors from the experiment. However, results from outdoor live-versus-recorded tests cannot be generalized to how the loudspeakers would perform in real rooms, where the off-axis sounds provide a significant contribution towards the listener's impression of the loudspeaker.
Lack of Proper Scientific Protocols, Listener Response Data, Statistical Analysis, Results
The most interesting characteristic of live-versus-recorded tests is that they never seem to provide listener response data, statistical analysis or published results. Eyewitness reports written in newspapers or magazines do not constitute scientific evidence.
Accuracy is Not Applicable to Most Recordings Made Today
Most recordings made today are not intended to sound like the live performance. Anyone who heard Taylor Swift's live performance with Stevie Nicks at the 2010 Grammy Awards understands why. About 90% of commercial recordings are studio creations consisting of a series of overdubs, processed with auto-tuning, equalization, dynamic compression, and reverb sampled from an alien nation. For these recordings, there is no equivalent live performance to which the recording/reproduction can be compared for accuracy. The only reference is what the artist heard in the recording control room. If the important performance aspects of the playback system through which the art (the music and recording) was created can be reproduced in the home, then the consumer will hear an accurate reproduction of the music, as the artist intended. It is possible to achieve this if we adopt a science in the service of art philosophy towards audio recording and reproduction.
In reviewing the history of live-versus-reproduced tests, most have been performed as elaborate sales and marketing demonstrations designed to fool listeners into believing that a product sounded much better and more accurate than it actually was. While live-versus-recorded tests have proven their merit as an effective marketing and sales tool, they have not yet proven themselves as a serious method for scientific experiments intended to advance our psychoacoustic understanding of music recording and reproduction.
The reason for this, I believe, is that live-versus-recorded tests do not adequately control important listening test nuisance variables, a prerequisite for accurate, reliable and scientifically valid results. It is not entirely coincidental, that (to my knowledge) none of the live-versus-recorded tests to date have produced a single scientific publication or new psychoacoustic knowledge.
Hopefully, you now understand why I don’t conduct live-versus-recorded loudspeaker listening tests.
 Harvith, J., and Harvith, S. Edison, Musicians and the Phonograph: A Century in Retrospect￼, Greenwood Press, N.Y (1987).
 Andre Milliard, “Edison’s Tone Tests and the Ideal of Perfect Sound Reproduction,” from Lost and Found Sounds’, NPR.
 Program for Edison Demonstration http://www.nipperhead.com/old/tonetest04.htm
 Wharfedale History: http://www.wharfedale.co.uk/About/History/tabid/66/Default.aspx
 Acoustic Research http://en.wikipedia.org/wiki/Acoustic_Research
 Edgar Villchur, http://edgarvillchur.com/
 Villchur, Edgar, “A Method of Testing Loudspeakers with Random Noise”, J. Audio Eng. Society, Vol. 10, Issue 4, pp, 306-309 (October 1962),
 Kissinger, John R."The Development of the Simulated Live-vs-Recorded Test into a Design Tool", presented at the 35th AES Convention, preprint 609, (October 1968)
 Olive, Sean E.; Schuck, Peter L.; Sally, Sharon L.; Bonneville, Marc E. “The Effects of Loudspeaker Placement on Listeners' Preference Ratings”,JAES Volume 42 Issue 9 pp. 651-669; September 1994.