Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×

Maintaining Quality in Digital Audio Chains

Tips and truisms about a subject that is often misunderstood

The following was excerpted from “Maintaining Audio Quality in the Broadcast and Netcast Facility.” In this segment, the authors deal with the many-faceted and often misunderstood subject of quality in digital audio chains.

In digital signal processing devices, the lowest number of bits per word necessary to achieve professional quality is 24 bits. There are several reasons for this.

Digital audio workstations need headroom to accommodate gain adjustments and mixing of several sources. Moreover, there are a number of common DSP operations (like infinite-impulse-response filtering) that substantially increase the digital noise floor, and 24 bits allows enough headroom to accommodate this without audibly losing quality. (This assumes that the designer is sophisticated enough to use appropriate measures to control noise when particularly difficult filters are used.) If floating-point arithmetic is used, the lowest acceptable word length for professional quality is 32 bits (24-bit mantissa and 8-bit exponent; sometimes called “single-precision”).

In digital distribution systems, 20-bit words (120 dB dynamic range) are usually adequate to represent the signal accurately. Twenty bits can retain the full quality of a 16-bit source even after as much as 24 dB attenuation by a mixer. There are almost no A/D converters that can achieve more than 20 bits of real accuracy, and many “24-bit” converters have accuracy considerably below the 20-bit level. “Marketing bits” in A/D converters are outrageously abused to deceive customers, and, if these A/D converters were consumer products, these bogus claims would be actionable by the Federal Trade Commission.

Sample rate controversy

There is considerable disagreement about the audible benefits (if any) of raising the sample rate above 44.1 kHz.

An extensive double-blind test using 554 trials showed that inserting a CD-quality A/D/A loop into the output of a high-resolution (SACD) player was undetectable at normal-to-loud listening levels by any of the subjects, on any of four playback systems. The noise of the CD-quality loop was audible only at very elevated levels.

tech area at KTBI
This is at KTBI(AM) 810 in Ephrata, Wash., one of three AM stations owned by American Christian Network. Running 50 kW daytime, it covers Spokane over 100 miles away.

Moreover, there has been at least one rigorous test comparing 48 kHz and 96 kHz sample rates. This test concluded that there is no audible difference between these two sample rates if the 48 kHz rate’s anti-aliasing filter is designed appropriately.

However, in 2016, a controversial “meta-analysis” of existing tests comparing high-resolution and CD-quality audio was published in the AES Journal.

According to the author, “Eighteen published experiments for which sufficient data could be obtained were included, providing a meta-analysis that combined over 400 participants in more than 12,500 trials.

“Results showed a small but statistically significant ability of test subjects to discriminate high resolution content, and this effect increased dramatically when test subjects received extensive training. This result was verified by a sensitivity analysis exploring different choices for the chosen studies and different analysis approaches.

“Potential biases in studies, effect of test methodology, experimental design, and choice of stimuli were also investigated. The overall conclusion is that the perceived fidelity of an audio recording and playback chain can be affected by operating beyond conventional resolution.”

Assuming perfect hardware, it can be shown that this debate comes down entirely to the audibility of a given anti-aliasing filter design, as is discussed below.

Far before the publication of the 2016 meta-analysis, in a marketing-driven push the record industry attempted to change the consumer standard from 44.1 kHz to a higher sampling frequency via DVD-A and SACD, neither of which succeeded in the mass marketplace. The industry is trying again with Blu-ray audio, and it remains to be seen if they will be more successful than they were with DVD-A or SACD.

FM stereo

Regardless of whether scientifically accurate testing eventually proves that this is audibly beneficial, sampling rates higher than 44.1 kHz have no benefit in FM stereo because the effective sampling rate of FM stereo is 38 kHz, so the signal must eventually be lowpass-filtered to 17 kHz or less to prevent aliasing. It is beneficial in DAB, which typically has 20 kHz audio bandwidth, but offers no benefit at all in AM, whose bandwidth is no greater than 10 kHz in any country and is often 4.5 kHz.

Some A/D converters have built-in soft clippers that start to act when the input signal is 3–6 dB below full scale. While these can be useful in mastering work, they have no place in transferring previously mastered recordings (like commercial CDs). If the soft clipper in an A/D converter cannot be defeated, that A/D should not be used for transfer work.

Dither

Dither is random noise that is added to the signal at approximately the level of the least significant bit. It should be added to the analog signal before the A/D converter, and to any digital signal before its word length is shortened. Its purpose is to linearize the digital system by changing what is, in essence, “crossover distortion” into audibly innocuous random noise.

Without dither, any signal falling below the level of the least significant bit will disappear altogether. Dither will randomly move this signal through the threshold of the LSB, rendering it audible (though noisy). Whenever any DSP operation is performed on the signal (particularly decreasing gain), the resulting signal must be re-dithered before the word length is truncated back to the length of the input words.

Ordinarily, correct dither is added in the A/D stage of any competent commercial product performing the conversion. However, some products allow the user to turn the dither on or off when truncating the length of a word in the digital domain. If the user chooses to omit adding dither, this should be because the signal in question already contained enough dither noise to make it unnecessary to add more.

Many computer software volume controls do not add dither when they attenuate the signal, thereby introducing low-level truncation distortion. It is wise to bypass computer volume controls wherever possible, and if this is not possible, to maintain unity gain through the volume control. Microsoft Windows Media Player and Adobe Flash Players should be operated at 100% (0 dBFS) at all times, and level control should be done either at the amplifier volume control or console fader.

In the absence of “noise shaping,” the spectrum of the usual “triangular-probability-function (TPF)” dither is white (that is, each arithmetic frequency increment contains the same energy). However, noise shaping can change this noise spectrum to concentrate most of the dither energy into the frequency range where the ear is least sensitive. In practice, this means reducing the energy around 4 kHz and raising it above 9 kHz. Doing this can increase the effective resolution of a 16-bit system to almost 19 bits in the crucial midrange area, and is standard in CD mastering. There are many proprietary curves used by various manufacturers for noise shaping, and each has a slightly different sound.

It has been shown that passing noise shaped dither through most classes of signal processing and/or a D/A converter with non-monotonic behavior will destroy the advantages of the noise shaping by “filling in” the frequency areas where the original noise-shaped signal had little energy. The result is usually poorer than if no noise shaping had been used.

For this reason, Orban has adopted a conservative approach to noise shaping, recommending so-called “first-order highpass” noise shaping and implementing this in Orban products that allow dither to be added to their digital output streams. First-order highpass noise shaping provides a substantial improvement in resolution over simple white TPF dither, but its total noise power is only 3 dB higher than white TPF dither. Therefore, if it is passed through additional signal processing and/or an imperfect D/A converter, there will be little noise penalty by comparison to more aggressive noise shaping schemes.

One of the great benefits of the digitization of the signal path in broadcasting is this: Once in digital form, the signal is far less subject to subtle degradation than it would be if it were in analog form, although in fixed point form it is still subject to clipping. Short of being clipped or becoming entirely un-decodable, the worst that can happen to the signal is deterioration of noise-shaped dither, and/or added jitter.

Jitter

Jitter is a time-base error. The only jitter than cannot be removed from the signal is jitter that was added in the original analog-to-digital conversion process. All subsequent jitter can be completely removed in a sort of “time-base correction” operation, accurately recovering the original signal. The only limitation is the performance of the “time-base correction” circuitry, which requires sophisticated design to reduce added jitter below audibility. This “time-base correction” usually occurs in the digital input receiver, although further stages can be used downstream.

Sample rate converters can introduce jitter in the digital domain because they resample the signal, much like A/D converters. Maintaining lowest jitter in a system requires synchronizing all devices in the audio chain to a common word clock or AES11 signal. This eliminates the need to perform cascaded sample rate conversions on the signals flowing through the facility. Good word clock generators have very low jitter (also known as “phase noise”) and allow the cascaded devices to perform at their best.

Busting the myths

There are several pervasive myths regarding digital audio.

One myth is that long reconstruction filters smear the transient response of digital audio, and that there is thus an advantage to using a reconstruction filter with a short impulse response, even if this means rolling off frequencies above 10 kHz. Several commercial high-end D-to-A converters operate on exactly this mistaken assumption. This is one area of digital audio where intuition is particularly deceptive.

The sole purpose of a reconstruction filter is to fill in the missing pieces between the digital samples. These days, symmetrical finite-impulse-response filters are typically used for this task because they have no phase distortion. The output of such a filter is a weighted sum of the digital samples symmetrically surrounding the point being reconstructed. The more samples that are used, the better and more accurate the result, even if this means that the filter is very long.

It’s easiest to justify this assertion in the frequency domain. Provided that the frequencies in the passband and the transition region of the original anti-aliasing filter are entirely within the passband of the reconstruction filter, then the reconstruction filter will act only as a delay line and will pass the audio without distortion. Of course, all practical reconstruction filters have slight frequency response ripples in their passbands, and these can affect the sound by making the amplitude response (but not the phase response) of the “delay line” slightly imperfect. But typically, these ripples are in the order of a few thousandths of a dB in high-quality equipment and are very unlikely to be audible.

The authors have proved this experimentally by simulating such a system and subtracting the output of the reconstruction filter from its input to determine what errors the reconstruction filter introduces. Of course, you have to add a time delay to the input to compensate for the reconstruction filter’s delay. The source signal was random noise, applied to a very sharp filter that band-limited the white noise so that its energy was entirely within the passband of the reconstruction filter. We used a very high-quality linear-phase FIR reconstruction filter and ran the simulation in double-precision floating-point arithmetic. The resulting error signal was a minimum of 125 dB below full scale on a sample-by-sample basis, which was comparable to the stopband depth in the experimental reconstruction filter.

We therefore have the paradoxical result that, in a properly designed digital audio system, the frequency response of the system and its sound is determined by the anti-aliasing filter and not by the reconstruction filter. Provided that they are realized with high-precision arithmetic, longer reconstruction filters are always better.

This means that a rigorous way to test the assumption that high sample rates sound better than low sample rates is to set up a high-sample rate system. Then, without changing any other variable, introduce a filter in the digital domain with the same frequency response as a high-quality anti-aliasing filter that would be required for the lower sample rate. If you cannot detect the presence of this filter in a double-blind test, then you have just proved that the higher sample rate has no intrinsic audible advantage, because you can always make the reconstruction filter audibly transparent.

Tech rack at KTWO
KTWO(AM) 1030 in Casper, Wyo., a Townsquare Media station. With 50 kW daytime omnidirectional and 50 kW directional night, it covers 75% of the state of Wyoming.

Another myth is that digital audio cannot resolve time differences smaller than one sample period and therefore damages the stereo image. People who believe this like to imagine an analog step moving in time between two sample points. They argue that there will be no change in the output of the A/D converter until the step crosses one sample point and therefore the time resolution is limited to one sample.

The problem with this argument is that there is no such thing as an infinite-risetime step function in the digital domain. To be properly represented, such a function has to first be applied to an anti-aliasing filter. This filter turns the step into an exponential ramp, which typically has equal pre-and post-ringing. This ramp can be moved far less than one sample period in time and still cause the sample points to change value.

In fact, assuming no jitter and correct dithering, the time resolution of a digital system is the same as an analog system having the same bandwidth and noise floor. Ultimately, the time resolution is determined by the sampling frequency and by the noise floor of the system. As you try to get finer and finer resolution, the measurements will become more and more uncertain due to dither noise. Finally, you will get to the point where noise obscures the signal and your measurement cannot get any finer. However, this point is orders of magnitude smaller in time than one sample period and is the same as in an analog system with the same bandwidth.

A final myth is that upsampling digital audio to a higher sample frequency will increase audio quality or resolution. In fact, the original recording at the original sample rate contains all of the information obtainable from that recording. The only thing that raising the sample frequency does is to add ultrasonic images of the original audio around the new sample frequency. In any correctly designed sample rate converter, these are reduced (but never entirely eliminated) by a filter following the upsampler. People who claim to hear differences between “upsampled” audio and the original are either imagining things or hearing coloration caused by the added image frequencies or the frequency response of the upsampler’s filter. They are not hearing a more accurate reproduction of the original recording.

This also applies to the sample rate conversion that often occurs in a digital facility. It is quite possible to create a sample rate converter whose filters are poor enough to make images audible. One should test any sample rate converter, hardware or software, intended for use in professional audio by converting the highest frequency sinewave in the bandpass of the audio being converted, which is typically about 0.45 times the sample frequency.

Observe the output of the SRC on a spectrum analyzer or with software containing an FFT analyzer (like Adobe Audition). In a professional-quality SRC, images will be at least 90 dB below the desired signal, and, in SRC’s designed to accommodate long word lengths (like 24 bit), images will often be –120 dB or lower, assuming a 24-bit path (which is capable of representing low-level energy down to –144 dBFS).Taking full advantage of high-performance sample rate conversion is another reason to use 24-bit audio for production and to reduce the bit depth (if necessary for applications like burning audio CDs) only as the final step, using appropriate dither.

A good reference on sample rate conversion performance can be found at http://src.infinitewave.ca/.

Less is more!

And finally, some truisms regarding loudness and quality: Every radio is equipped with a volume control, and every listener knows how to use it. If the listener has access to the volume control, he or she will adjust it to his or her preferred loudness. After said listener does this, the only thing left distinguishing the “sound” of the radio station is its texture, which will be either clean or degraded, depending on the source quality and the audio processing.

Any program director who boasts of his station’s $20,000 worth of “enhancement” equipment should be first taken to a physician who can clean the wax from his ears, then forced to swear that he is not under the influence of any suspicious substances, and finally placed gently but firmly in front of a high-quality monitor system for a demonstration of the degradation that $20,000 worth of “enhancement” causes! Always remember that less is more.

Comment on this or any article. Email [email protected].

Close