SACD on music's highest frequencies. Here too the problem is made worse if the very high frequency musical note is a single transient that is non-repetitive, as opposed to a continuing and repeating high frequency sound, since a single non-repeating transient affords even less of a time window to gather data samples for the averaging process needed to reduce noise and distortion.
     What are some examples of single non-repetitive treble transients in real music?
     First, consider transient attack sounds, such as the instant that a triangle is first struck. This only occurs once, so a digital system must have good enough handling to track this single sharp curve in the road. DVD-A reproduces this triangle attack transient better than any other digital system, correctly and cleanly capturing the "t" sound at the beginning of the triangle's "tinggg". But DSD misses this "t" attack altogether (there's no repetition to average), and it thereby makes the triangle ting sound too soft, like a "dinggg" instead of the correct "tinggg". All DSD does capture well is the continuing and repeating after-ringing "inggg" sound of the triangle, after the attack transient has passed. Thus, from DVD-A, a triangle properly sounds like a "tinggg", but DSD-SACD softens this sound to a "dinggg".
     Second, consider the individual transient spikes of noise that constitute a vocal sibilant. Each individual spike is separated from and is different from its neighbors. That's what makes a vocal sibilant sound real, like a jet of escaping steam (as discussed in the article above). Because each spike is different, it is not an identical repetition of its predecessor, so each is a non-repetitive transient that must be tracked individually, just as each curve in a mountain road is different from the previous curve and must be tracked individually. DVD-A does this superbly, and thus reproduces very realistic sounding vocals. But DSD-SACD, relying on averaging, treats all these individual spikes as if they were the same repeating sound, thereby making the vocal sibilant less real, more homogenized, more generic (everyone's sibilant sounds different, depending for example on their tooth pattern, but DSD-SACD erases these individual differences). Then, to make matters even worse, DSD-SACD fails to track the sharp curves of each separated spike in the musical waveform of the vocal sibilant. The averaging function, on which DSD-SACD relies, instead negotiates an averaged path, which fails to reach the apexes of each sharp curve, the peaks of each transient spike, and which, yet worse, fills in the valleys of relative intertransient silence between peak spikes. This relative intertransient silence is all that keeps the individual spikes separated from each other, so DSD-SACD blends the spikes together into a different mushy waveform, which sounds like "shshsh" instead of the original "ssssss".
     Third, consider the individual transient spikes of a cymbal's sounds. Any musician can tell you that cymbals (both orchestral and jazz) have very complex sounds, so much so that each brand of cymbal and indeed each individual cymbal has a very distinct sonic personality, as distinct as human voices. Scientists since Lord Rayleigh have known that metal discs have very complex vibrational modes, thus producing very complex sounds. And each cymbal can produce a wide range of complex sounds, whether shimmering or crashing, or being struck (by various objects with various shapes and materials, from wire brushes to wooden drumsticks). The complex sounds of a cymbal arithmetically add at every instant, and even interact at every instant, to produce a musical waveform that is constantly changing and so complex that statistically it would never repeat itself. This is the musical opposite of a clarinet, which essentially produces a simple sine wave pattern that does nothing but repeat itself. Thus, a digital system that relies on averaging a repeating musical waveform cannot use this technique to capture the sound of cymbals accurately. That's why DVD-A, a PCM system, can capture the sound of cymbals so superbly, while DSD-SACD, relying on averaging, turns a cymbal sound into a mushy porridge of sound.
     Both wide bandwidth and deep bit resolution are needed to track and capture the complex sounds of cymbals. Today's 16/44 PCM digital gets close on the best recordings, but still can't quite cut the mustard. But DVD-A's 24/96 PCM has twice the bandwidth and 256 times more detailed resolution, which is why it sounds superb in reproducing cymbals, from the most delicately subtle shimmerings to the biggest crashes.
     The DSD-SACD system not only averages cymbal sounds into a mushy porridge, but also distorts them, evincing very audible signs of distress, including frazzled fragmentation and trashy smearing. We believe this happens because DSD-SACD cannot handle high slew rate music, and literally falls apart if asked to do so, crashing off the curvy road it can't track (see discussion in 1998 Master Guide). Cymbals impose a very high slew rate on the music waveform, since their sound has a lot of energy content at high frequencies.
     Just how much energy content, at how high a frequency, does a cymbal have? The answer will tell us what is required of a digital system, indeed the whole audio chain, if we are to reproduce the sound of a cymbal accurately.
     To find the answer, we did some research and made some measurements. First, we wanted to measure the bandwidth required for a delicate cymbal sound. So we gently kissed two cymbals together, producing a delicate shimmering (similar to the delicate cymbal kisses in the quiet section just before the coda of the last movement of Rachmaninoff's second piano concerto). We measured the transient of the initial gentle kiss and shimmer. Every musical transient has a spectral content that includes a spread of frequencies, not just a single frequency nor a single dominant frequency with discrete overtones. But we can look at the musical transient's spread of frequencies and note how the energy is spectrally distributed. This will tell us where a digital system has to have strong handling capability, in order to accurately track and reproduce a cymbal's sound. It's the equivalent of going out and measuring the sharpness of mountain road curves, to tell you how good you have to make a car's handling if you want that car to track that road without crashing off the edge.
     The measured spectral distribution of this gentle cymbal kiss looked like a mountain, with a definite single peak in the center showing where most of the energy resides, and a gradual falling off of energy on both sides, at lower and higher frequencies. Now comes the key question. What was the frequency of the center mountain peak, where most of the cymbal's energy resided from this gentle kiss? Any guesses?
     Wrong! Even we were flabbergasted by what we saw on the FFT analyzer The sound of this gentlest kiss from a live cymbal has its energy peak at 40,000 Hz!! This means that most of the musical information describing and portraying the sound of a cymbal, even a gentle kiss, is located near 40,000 Hz. Therefore, a digital system (indeed the whole audio chain) must have a bandwidth capability beyond 40,000 Hz, if it is to have a hope of capturing just the main hunk of musical information portraying the sound of the cymbal. Note that 24/96 PCM digital, with a bandwidth of 48 kHz, does meet this requirement, but just barely.
     Our measuring microphone and FFT analyzer both had a bandwidth of 100 kHz, and they showed that the skirts of the mountain, representing the cymbal kiss' spectral content, still had significant energy at 100 kHz and surely beyond. Thus, if we wanted a recording system to capture some of the natural overtones of a cymbal sound (not just the fundamental at the spectral peak), it would have to have a bandwidth capability to at least 100 kHz. Note that only 24/192 PCM digital, with a true sampling rate (not oversampling) of 192 kHz and a consequent bandwidth of 96 kHz, comes close to meeting this requirement.
     Since most of the musical information describing the individual transient shimmering noises resides at 40,000 Hz, it is clear why 16/44 PCM (today's CD standard, having a mere 20,000 Hz bandwidth) cannot portray cymbals with great accuracy and the proper delicacy. It is clear why PCM digital systems sampling at 96 kHz and 192 kHz can capture the true sound of cymbals so much better than 16/44 PCM. It is also obvious that a digital system such as DSD-SACD, which relies heavily on averaging to re-constitute an averaged porridge version of music above 8000 Hz, cannot hope to accurately capture the individual, non-repetitive and hence non-averageable, cymbal shimmerings that reside primarily so far beyond 8000 Hz, indeed primarily at 40,000 Hz and even beyond.
     Furthermore, the cymbal kiss' spectral peak at 40,000 Hz not only means that most of the musical information required to accurately portray this sound resides in that vicinity, but also that most of the musical energy resides there. A large, indeed dominant, amount of musical energy at such a high frequency means that the musical waveform will have very steep sections, requiring a very high slew rate to keep up. A high slew rate capability is the strength of a PCM digital system, but is the worst Achilles' heel of a 1 bit system such as DSD-SACD. If a 1 bit system cannot keep up with a fast, steeply slewing music waveform, it will grossly distort, crashing off the path of the music waveform roadway with a grotesque crashing noise of frazzled distortion -- which is precisely what we hear from the Sony-Philips DSD-SACD on cymbal sounds.
     What about loud cymbal sounds? We wanted to learn why a digital system that could do reasonably well reproducing most of symphonic music would nevertheless fall apart with distortion when a cymbal crash comes along, as DSD-SACD does. So we measured the spectral energy profile of a loud cymbal crash, and compared it to the spectral energy of the other instruments in a symphony orchestra. The results of this measurement were equally shocking. So shocking that we wound up measuring the spectral energy of a cymbal crash compared not just to various other individual instruments of a symphony orchestra taken singly, but rather to the whole orchestra taken together, and the whole orchestra in full cry at that.
     We measured a full orchestra playing Stravinsky's Le Sacre du Printemps. To be conservative, we decided to assess the duller spectral balance you would typically hear live, so we measured from a typical distant concert hall seat (not from an up close or overhead miking position, which would have emphasized the higher frequencies yet more). There's a section of Le Sacre where the orchestra is blazing away in full cry, including various brass instruments that have prodigious energy output, including steep high frequency spikes. The spectral energy curve of the full orchestra blazing away here was flat up to about 2000 Hz, and then above that it sloped downward, gradually and evenly, up to the 20,000 Hz limit we were using on this particular measurement (so as to include a profile of lower frequencies too). Incidentally, this downward slope was expected, from this distant seat on the auditorium floor; it helps explain why orchestral recordings made with mikes overhead and/or up close often sound too bright and unlike the live concert-going sound, and why people often prefer speakers with downward sloping power responses for reproducing the typical close overhead miking of orchestral recordings.
     So here we had a spectral picture of the whole orchestra in full cry: a flat plateau up to 2000 Hz, followed by a valley for the giant region above 2000 Hz all the way up to 20,000 Hz.
     Then, in this section of Le Sacre, a cymbal crash joins in the orchestra melee (a few cymbal crashes actually, but it was enough to measure just the first one). What did this single cymbal crash do to the spectral energy curve of the full orchestra? It completely changed it. The single cymbal crash had so much energy that it overwhelmed the whole rest of the orchestra taken together. More significantly, the huge power output of the cymbal had a completely different spectral energy profile. The cymbal's sonic energy literally filled in the entire giant valley that the full orchestra had left between 2000 Hz and 20,000 Hz. The spectral energy content of a full orchestra with cymbal crash was now flat all the way to the 20,000 Hz limit of our measurement (and probably well beyond, since even a gentle cymbal kiss has its predominant energy centered around 40,000 Hz).
     This means that the crash of a single set of cymbals has more energy above 2000 Hz than the whole rest of the orchestra put together. It also means that, with flat energy content to 20,000 Hz and beyond, that a cymbal crash has a steep slew rate far beyond that of the rest of the orchestra. Thus, the task of handling a full orchestra is easy for a digital system (or other audio component), compared to the task of handling a single cymbal crash. It's the same story again, as the smooth circular track vs. the real world mountain road, for proving the merit of a car's handling capabilities. The fact that some digital system might be capable of pretty well handling a full orchestra without cymbals, as DSD-SACD advocates might claim, doesn't begin to approach relevance to demonstrating the true overall merit of this digital system for music as a whole. That's because the full orchestra is such an easy piece of cake for a digital system, compared to the crash of a single cymbal set. This contrast is especially keen for a 1 bit system (e.g. DSD-SACD), since the difference in task difficulty here is primarily one of slew rate, and it is precisely slew rate which is the principal weakness of 1 bit digital systems.
     PCM digital systems would generally be relatively immune to this dramatic difference in spectral energy, because they can leap tall buildings in a single bound. And PCM systems with higher sampling rates and greater bit depths will naturally do a better job of handling the high sharp peaks implied by the cymbal's huge energy content at high frequencies, while still preserving the complex interplay of the myriad other subtle sounds being simultaneously emitted by the rest of the orchestra (not to mention the complex subtle sounds of the cymbals themselves).
     That's why the DVD-A 24/96 PCM system did such a superb job reproducing the cymbals in the orchestral recording they played. That's why, in contrast, all three Sony-Philips DSD-SACD demos butchered the sound of cymbals. Of course, if a recording engineer is dead set on using DSD, he could simply tell the orchestra (or the jazz combo) to play without cymbals. And then, if he's recording any singers, he could tell them to pretend they had a lisp, so their singing would sound like thinging, and then DSD could handle it.
     The new DVD-A digital standard has capabilities far beyond 24/96 audio. In fact, the principal idea of the DVD-A standard is to provide a flexible digital medium as an arena that allows for many different formats.
     This flexible concept is in sharp contrast to all other media standards where there is only one format, and everyone is locked into that format, from the record producer to you the customer. If you buy a CD player it can only play CDs conforming to essentially one format. The same limitation has been true for virtually all media, from VHS to video DVD, and even the Sony-Philips DSD-SACD is limited to one format on one medium. The record producer can issue software only in one limited format per medium, and you the consumer must buy a different player for each format, since each player is pretty much limited to playing only one format.
     But the engineers for DVD-A had a brilliant idea, looking forward to the future. Why not create a standard in which a number of different formats could be supported by one medium and one player? This way, each record company, each studio, each producer could create audio and/or video software in a format that best suited the program content and the studio's hardware mastering capabilities. And, this way, you the consumer would only have to buy one playback machine, which could bring to you a whole world of different formats. For example, on one machine you could play super quality (24/96) six channel surround sound, or other audio formats (including 24/192 two channel), or movies, or music videos, or instructional A/V programs.
     But how could the DVD-A engineers get a medium to support such a wide range of different formats? Actually, the tools are already pretty much in place. Video DVD already has the required bandwidth and storage capacity as a medium. So all that would be required would be to make existing video DVD players intelligent, so they could recognize and process different formats.
     And the tools for adding this intelligence already exist, indeed they already existed even in the humble early CD players. Firstly, a CD player is able to automatically read identifying information off each CD you insert, before it begins playing the program. So, using this same technology, it would be a piece of cake for a DVD-A player to identify the type of format contained on each DVD type disc you insert. Secondly, a CD player's internal computers already read the digital data off a CD in blocks, then take apart this data and rebuild it in various ways, to eventually produce the stream of digital data that is fed to the DAC to become analog music. So, using this same technology, it would be a piece of cake to make this internal computer capable of taking apart data and rebuilding it from a DVD disc not just in one fixed way, but rather in a choice of different ways, each corresponding to a different format.
     Each different format supported by the DVD-A standard might be best suited to a different type of software. These various formats can accommodate masters from a wide variety of recording studios and mastering/mixing facilities, so these studios and facilities can use their existing equipment and still work within the DVD-A standard. The DVD-A standard does not force a studio or facility to buy expensive new equipment, nor to revamp their whole mixing and editing chain, nor to restrict their output to one format (although DSD-SACD does force all these requirements). With DVD-A, a single studio might choose to release software in a number of different formats, to suit different program material and/or different target markets for the same program material. All this flexibility and power, with minimal retooling investment, means that DVD-A should be very popular with studios and producers, the people and companies who create and package our home entertainment.
     You the consumer have similar flexibility with DVD-A. With one investment in one playback machine you have access to a wide variety of software in a wide variety of formats. All this flexibility and power, for just one investment, means that DVD-A should be very popular with consumers. In contrast, DSD-SACD also forces you to buy a new playback machine, but the only thing your investment will be good for is playing one format: audio SACDs in the sonically crippled DSD format (as well as your existing 16/44 CDs, but you already have a machine for that, so this aspect is a wasted investment).
     DVD as a medium has large enough storage capacity and wide enough bandwidth so that the DVD-A standard can afford to be generous, and accommodate all these different formats for encoding audio, multichannel, still video, moving video, etc. Indeed, DVD-A is so generous and flexible that it can even accommodate DSD encoded audio, should anyone besides Sony actually want to use this sonically inferior encoding schema.
     The DVD-A system also has the advantage of considerable engineering and marketing clout. The system comes from Working Group 4 (dedicated to audio) of the DVD Consortium, which includes

(Continued on page 20)