Page Title

Digital System Wars, Part 3

In Hi-Fi News (HFN), our colleague Keith Howard has been running a series of articles inquiring into the mysteries of upsampling. His keen intellect and probing queries make this series fascinating and thought provoking, even if we sometimes happen to disagree with his thoughts or conclusions.
In the latest (April 2002) installment of this series, however, there is a statement which could mislead many readers. More importantly, it promulgates an insidious half-truth, fostering a misbelief that has already infected the minds of many digital engineers, including those at Sony and Philips. So it's important to shed some light on this issue.
The HFN article indicates that the actual resolution of a 16 bit system is not limited to 16 bits, but in fact is virtually infinite. It goes on to say that merely adding dither noise with a magnitude of 1 LSB will enable a 16 bit system to resolve a signal of interest down to any finer (higher bit) resolution, and it demonstrates this with a measurement of a sine wave signal successfully recovered by a 16 bit system, even though the sine wave signal way down at the 24 bit level and thus is buried in the noise.
These statements contain just enough truth for them to be attractive bait for digital engineers to swallow. But they are tragically misleading, because of what's missing. These statements leave out two crucial facts.
First, the resolution enhancement of a 16 bit system to some better (higher bit) level of resolution is not achieved merely by adding appropriately scaled dither noise. The signal must also be averaged over many, many samples, for a long, long time, in order for this so-called resolution enhancement to work at all.
Second, this long term averaging only works for a simple sine wave. It does not work for music. But the whole context of our work and hobby and writings involves music reproduction systems, not sine wave reproduction systems. So it is tragically misleading in this context to be espousing a technique that does not even work for music signals, but only works for a sine wave signal.
A sine wave signal and a typical music signal are very different from each other. Indeed, they are virtually complete opposites. A sine wave is the simplest possible signal, and it stays the same forever, without ever changing. A typical music signal is very complex, and it is constantly changing, without ever staying the same for long.
You might hear from some people that a music signal is essentially just a bunch of sine waves added together. This statement is actually patently false (although it is admittedly true that the concept, of adding sine waves together to form a more complex signal, does have limited utility as a crudely approximate engineering rule of thumb, provided that we keep foremost in our mind that it is just a crude engineering approximation). The actual truth is that any transient or changing music signal in fact contains an infinitely dense packing of an infinite number of frequencies, and therefore cannot be truly represented by any finite grouping of sine waves. The ever changing nature of a typical music signal needs to be accurately represented in the time domain, moment by moment, so it can be presented to the human ear/brain, and this certainly cannot be done by any finite grouping of sine waves. Furthermore, the actual truth is that any true sine wave goes on unchanged forever, and therefore it cannot possibly change (or be turned on and off, or made louder and softer), for the goal of making a changing contribution to a music signal which is constantly changing, and whose constituent components must therefore also be constantly changing.
Let's start our exploration of this whole issue with the germ of truth contained in the HFN article's claim. Let's first study how a 16 bit system can indeed achieve higher resolution than 16 bits, when reproducing just a simple sine wave. You might encounter naysayers (at the BAS and elsewhere) who don't believe that resolution enhancement can work at all, and who pooh-pooh it as pure hype. But, as we have explained in previous IAR articles, resolution enhancement does indeed work, and can achieve genuine benefits, both measurably and audibly. However, we need to get a firm understanding and analytic grasp of how it works and what its limitations are, and of what goes wrong if its limitations are violated. That's what this article is all about, especially as it relates to today's debate regarding PCM vs. DSD/SACD.

Summing to Quiet Noise

Imagine that we want to reproduce just one cycle of a sine wave. The Nyquist theorem says that we can achieve this by sampling just two data points (hence the sampling frequency can be just twice the sine wave frequency). To keep our example free of needless complications, suppose that the two sampling points occur at the positive and negative peaks of the sine wave. In accordance with the Nyquist theorem, the remainder of the sine wave shape can later be fleshed out automatically by the appropriate playback reconstruction filter (a boxcar filter with a sinx/x time domain response). Now let's focus our attention on just the positive half cycle, so we only have to pay attention to one sample point (whatever we discover will then apply equally well to the negative half cycle and its sample point at the negative peak).
Focusing our attention like this makes clear that our job is supremely simple. All we have to do, in order to correctly reproduce the entire positive half cycle of the sine wave, is to discover and determine one simple number. This one simple number will represent the amplitude of the sample point at the positive sine wave peak. Of course, that's what digital encoding is all about: determining and encoding the amplitude of the input waveform at the moment of the sampling point. So here we have a sine wave input, and all we have to do in order to accurately reproduce it is to determine the one number representing its amplitude at its positive peak.
Next, let's suppose that this sine wave input has some noise added to it. The amount of noise is not important here; it could be a little noise, or so much noise that the noise amplitude is higher (louder) than the sine wave amplitude, whereby the sine wave is actually buried by the noise. Also, the source of the noise is not important for our discussion here; it could be random input noise, purposely added dither noise, quantization error noise, etc.
The effect of this noise, added to the sine wave, would be to make the sample point at the positive peak move unpredictably up or down from its correct value that it should have to accurately represent just the sine wave without the added noise. Thus, we could no longer discover and determine the correct amplitude of this sample point for reproducing just the sine wave (without the added noise). And, since the whole sine wave is reconstructed from this one sample point, the whole sine wave could no longer be accurately reproduced.
Clearly, this added noise is undesirable. Is there any way we can quiet this noise, in order to get closer to our goal of accurately determining what the correct value should be, of that sample point at the positive peak, but without the added noise, so that we can accurately reproduce the whole sine wave itself, even though it is partially obscured by noise? Yes.
How? In a word, averaging. How and why can averaging quiet noise?
Random noise varies randomly in amplitude and in polarity. If you look at noise at a given instant, it might be positive or negative in polarity, and it might be positive or negative by a small amount or by a larger amount. Thus, if random noise is added to the sine wave signal we seek, then whatever the noise happens to randomly be, at the instant of the sampling point at the sine wave's positive peak, will be added to the correct amplitude of that sampling point. When we evaluate the amplitude of that sampling point at the time of the sine wave's positive peak, the amplitude we perceive will be higher or lower than the correct value for accurately reproducing the sine wave itself, and it will be higher or lower by whatever the value of the random noise amplitude randomly happens to be at that instant (if the noise amplitude happens to be negative at that instant, then the value we perceive will be lower than it should be).
Now, that input sine wave, which we are trying to decipher and reproduce, keeps repeating itself over and over without change (that's what sine waves do, and indeed do by definition, otherwise they wouldn't even be sine waves in the first place). Imagine then that we look not at just one of its positive peaks, but instead at many of its positive peaks. Or to put it another way, imagine that we look at the same positive peak over and over and over and over, as it keeps uniformly repeating itself through time.
Since random noise has been added to the sine wave, some of the times we look, at a sample taken at a positive peak, that sample will be higher in amplitude than it should be, and some of the times lower. Sometimes it will be off (higher or lower) by just a little bit, and sometimes it will be off by a lot. All this variation occurs unpredictably and randomly.
If we were to look at a movie of the sample points taken at the sine wave's positive peaks, we would see their amplitude jiggling up and down, in a random manner, as determined by the random noise that is being added to the correct value of the positive peak of just the true sine wave itself.
What does the waveform of noise look like? Simply speaking, there are lots of small spikes or peaks (both positive and negative in direction and polarity), and generally fewer large spikes or peaks, with zero crossings occurring at irregular intervals.
Since the noise itself is random, statistics tells us that over the long run of time it will spend an equal amount of time or energy in positive territory as in negative territory. In other words, for every small positive peak of noise there will in the long run be an equal and opposite small negative peak of noise, and for every large positive peak of noise there will be an equal and opposite large negative peak of noise (or at least an equal accumulation of slightly less large negative peaks).
Now, if we could somehow magically get positive noise energy to cancel out negative noise energy over the long run, we'd be left with zero noise energy, since there's an equal amount of positive and negative random noise energy in the long run. In other words, we'd succeed in canceling out and quieting the random noise added to the sine wave. Which means we'd be left with only the sine wave itself, and that's precisely the signal we want to discern and recover, without the added noise. By canceling out and quieting the added noise, we can accurately reproduce just the sine wave signal itself, even if it was buried in that added noise.
It's almost magic, the way that the unwanted noise just cancels itself out over the long run, and leaves only the desired sine wave signal behind, as though the noise had never even been there.
So, what could be the technique for somehow magically getting positive noise energy to cancel negative noise energy over the long run? Simple summation or addition -- which is also the first step you follow for doing averaging.
Let's forget about the sine wave signal for a minute, and concentrate on how we get summation to cancel out the noise itself. Suppose we gather in a million samples of a random noise signal. Some of those sample amplitudes will be positive by a little, some positive by a lot, some negative by a little, some negative by a lot. All we need to do is add up the amplitude values of those million samples. Statistically, we know that their sum will be very close to zero, since random noise has equal positive and negative energy over the long run, and a million samples is a close approximation of a long run.
We do have to sum many, many samples, to look at the random noise in the long run, if we want to quiet the noise significantly. Why? Because our noise canceling summation technique relies on the fact that random noise has equal positive and negative energy in the long run. If we want to take advantage of the fact that random noise tends to do something in the long run, then we have to keep our end of the bargain and look at it over the long run, which means many, many samples. If we only gather in and sum a short run of random noise, then we won't get much quieting of noise, because in the short run random noise tends NOT to have equal positive and negative energy.
Consider a simplified example. Suppose the noise varies randomly from an amplitude of +5 to an amplitude of -5, in simple integer values. Consider the limiting case in which we only look at one sample of this noise. There is very little chance that we will catch this noise at the precise moment it is at zero, so our sum will likely be some magnitude value of noise higher than zero. Consider next the case in which we look at two samples. One sample might be +4 and the other -2. These numbers cancel each other somewhat when we sum them, but not totally, so we do achieve some quieting of the noise, but not much.
You might wonder why the second sample is -2 rather than -4, since the first sample was +4. That's precisely because the noise is random, rather than a regular and uniform and predictable waveform. Random noise tends in the short run NOT to be equally positive and negative precisely because, in its randomness, a positive amplitude of one given value is usually followed by a negative amplitude of some different value (if the negative value were always the same as the previous positive value, then the waveform would be a uniform, regular or repeating signal, not noise). Indeed, with random noise, if we only took two samples they might well both be positive values (or both negative), whereby our sum would actually generate worse noise rather than canceling noise.
Next, consider a case in which we look at ten samples rather than two. In our example, that's like picking a random number between +5 and -5, and doing it ten times. Statistically, it's much more likely that this sequence of ten random numbers will be pretty equally balanced between positive and negative numbers, both in polarity and magnitude, than was the sequence of just two random numbers when we just looked at two samples. Then, if we look at 100 random samples instead of ten, it's still more likely that that balancing will be more exact. When that balancing is more exact, then the numbers cancel each other out better when summed, and so their sum gets closer to zero.
One way of getting an intuitive feel for this is to imagine that one of our early samples is a big spike of noise, which happens only rarely. Say this early big spike is in a positive polarity. We will then have to continue gathering noise samples for a long, long time before we finally see that rare big spike happen again, this time hopefully in a negative polarity, to effectively cancel out the positive polarity energy previously contributed by that early big positive spike, and thus to finally bring our sum in the long run closer to the zero noise level we seek.
Indeed, the more random noise samples we include in our summation, the closer their sum will likely converge to zero. The random noise numbers tend to cancel each other out in the long run, and so the longer the run, the more exactly they cancel each other out and the closer to zero is our resulting sum. Of course, the closer we can get the sum of our noise samples to converge to zero, the more effectively we can quiet the noise. If you want to get truly dramatic quieting of noise, it might be necessary for you to look at many thousands of samples (even perhaps a million) to sum them.
The moral or lesson is clear. We can indeed perform magic and get rid of random noise -- but only by gathering in many, many samples of the noise and adding them up to give us their sum. This lesson will come back to haunt us in short order.

Averaging to Discern Input Signal

Now let's return to looking at our original situation, which contained the desired sine wave signal plus unwanted added noise. Again, to keep things uncomplicated, we'll discuss just our attempt to accurately discern the correct value for a sample at the positive peak of the sine wave. Imagine that the correct value for the positive peak of the sine wave itself is +10 (of course, we don't know this yet, because the correct value is hidden by the added noise). Suppose again that the added random noise fluctuates between +5 and -5. This means that the positive peak of the sine wave signal with the added noise will randomly vary in value between +5 and +15. In other words, the signal plus noise that we see will fluctuate randomly up to 5 units on either side of the correct (but so far invisible to us) value of 10 that represents the positive peak of the desired sine wave itself without the unwanted added noise. For example, when we look at the signal plus noise the first time, its value might be +13 (representing a momentary random noise value of +3 added to the correct sine wave peak value of +10), while the next time we look the value might be +8 (representing a momentary random noise value of -2 added to the same correct sine wave peak value of +10).
How can we quiet the added noise, and get to see just the sine wave signal itself? How can we look at a series of signal plus noise values like +13 and +8, and from them figure out that the correct value for just the sine wave peak itself is +10? We've already learned that, to quiet the noise, all we have to do is look at many, many samples and add them together, so that the random noise can cancel itself over the long run. So that's how we can quiet the noise here as well. Then we only need to make one further minor adjustment, to accurately discover the correct value of the sine wave's positive peak that was previously hidden among the noise that we have now quieted by summation.
First, we look at many, many samples of the sine wave signal plus noise (for the sake of our discussion here, all these samples will be timed to be taken at the instants when the sine wave's positive peak is supposed to occur). Gathering these many, many samples together and simply summing them gives the random noise its chance to cancel itself out to almost zero, as it tends to do in the long run. The random noise might be adding a value of +3 (to the correct but hidden sine wave peak) at one sample, and -2 at the next sample, giving us values of +13 and +8, respectively, to add to our overall sum. In the long run, the summed contribution of the random noise tends to converge toward zero. Thus, in the long run, the sum we obtain by simply adding all the samples of sine wave plus noise tends to converge toward the sum we would obtain by adding together only the samples of sine wave positive peaks without noise, as though the noise had never even been there in the first place. In other words, the sum of +13, plus +8, etc., etc. converges toward the sum of +10, plus +10, plus +10, etc. It becomes obvious that the true correct value of the positive sine wave peak, previously hidden by the added noise, is +10. To obtain this +10 value directly by calculation, all we must do is one simple further adjustment, namely dividing by the number of samples taken. This turns our sum into an average.

(Continued on page 49 )