How to Process Audio for Streaming, Properly

Those who started in our field fewer than ten years ago probably don’t remember a time when there wasn’t a streamed version of your radio station available online.

When streaming media first became of interest, most of us involved had no prior experience in that endeavor. Today, we have the benefit of about 15 years of pioneering work done by other engineers, and of course, processing manufacturers. So what have they learned, and what is now considered the appropriate way to process audio for streaming media?

There are several fundamental differences in configuring audio processing for streaming applications as opposed to those for over-the-air applications. For FM over-the-air processors we need to consider:

The FM system is completely analog and linear (at least in the sense that there are no lossy-codecs used).
The FM system uses emphasis (and thus pre-emphasis limiting).
The 15 kHz audio bandwidth for FM means that a 32 kHz sample rate is adequate for A/D conversions.
The noise floor obtainable in most FM receivers limits the overall system to a resolution equivalent to about a 12-bit word length in the digital world.

Contrasted with streaming media:

Audio bandwidth is not limited to 15 kHz, and so sample rates may be far higher.
16-bit word length is common.
There is no need for emphasis limiting, as there is in FM.
Lossy-codecs are used to limit the overall data rate.

For this article, some of the best audio processing engineers in the field — Jeff Keith, Frank Foti, Bob Orban and Greg Ogonowski — agreed to answer two fundamental questions: “What are the design differences between OTA processors and those designed for streaming media?” and “How should ones approach to processing audio for streaming media or podcasts differ from that of on-air processing?”

Jeff Keith is a senior product development engineer at Wheatstone; Frank Foti is CEO of the Telos Alliance. Bob Orban is consultant to Orban Labs Inc., owned by DaySequerra. Greg Ogonowski is president of StreamS/Modulation Index LLC.

PRIMARY DIFFERENCES BETWEEN OTA AND STREAMING

Keith: The primary, and most important difference, is that the audio peak limiting schemes are completely different in on-air and streaming processors. Further, FM on-air processors also utilize a very aggressive boost of high frequencies [pre-emphasis], which is both unnecessary and undesirable in streaming applications.

Foti explained that the difference is not only in the functionality of the processing gear, but in the goals themselves.

Foti: Employ processing for consistent source-to-source level and tonal balance. While on-air processing also accomplishes this, there is the competitive loudness quest that broadcasters desire. This is not as prevalent in the streaming world, due to the buffering delay associated when connecting to another signal. Best to say that processing for sonic consistency and vocal intelligibility is most important.

Another important factor regarding the coded system is headroom. Digital systems have an absolute maximum ceiling of 0 dBFS. Theoretically, audio levels for transmission should be able to be set right up to this level. But, depending upon the encode/decode implementation, overshoots may occur.

This is not consistent from codec to codec, but more so due to the implementation of the codec by various manufacturers. Additional input low pass filters in the encoder may cause headroom difficulties. A well-designed encoder will insure that any added input filters possess the same headroom as the system, along without generating overshoots that reduce headroom. Most filter overshoots are of the 2 dB-3 dB magnitude, but can exceed this amount depending upon filter characteristics.

It would be wise to test any codecs within a specified infrastructure to make sure that 0dBFS is attainable without system overload or clipping. For this reason, setting the absolute peak level 2 dB-3 dB below 0 dBFS, offers insurance to avoid clipping.

Orban: The analog [FM] channel requires state-of-the-art pre-emphasis limiting to achieve competitive loudness and minimize pre-emphasis-induced high frequency loss. This usually implies use of sophisticated distortion-canceled clipping. The streaming channel, on the other hand, has no pre-emphasis but is typically heavily bit-reduced via a perceptual codec. This creates an entirely different set of requirements: The peak limiting must not use clipping because there is no bit budget available to encode clipping-induced distortion products.

However, pre-emphasis limiting is unnecessary. The best technology for peak limiting the streaming channel is therefore look-ahead limiting, which can perform very clean peak reduction on flat channels, but which is unsuitable for pre-emphasized channels.

CONSIDERATIONS FOR PROCESSING STREAMING MEDIA

As the importance of streamed versions of our radio stations continues to grow it’s important to consider just what is involved in effectively processing that delivery method for audio. In the early days of streaming, the bandwidth available for the radio stations, as well as end-users, was far more limited than now; lossy-codecs were the order of the day.

Today there’s more and more talk of high-resolution streaming, but even so, very few stations offer completely “loss-less” audio streams.

In the last several years, Ogonowski has turned his attention to focus on streaming media, and in answering my second question, he considered both linear and lossy-codecs.

Ogonowski: Audio processing considerations differ for linear PCM and coded digital audio. Linear PCM doesn’t have perceptual audio encoders and decoders in the signal path that need special attention.

Both linear PCM and coded audio systems should use over-sampled limiters to prevent any 0 dBFS+ or true peak build up after A/D conversion. Anytime energy is removed or group delay is disturbed from a peak controlled audio signal, it runs the risk of peak overshoot, and hence system overload. Peak-controlled signals in linear PCM systems only need attention for the low and high frequency responses of the systems through which the signal is passed in order to maintain proper peak levels accurately.

In codec audio systems, such as AAC or MP3, there is another consideration that must be taken into account. Perceptual audio encoder/decoder signal paths remove energy within the audio passband, and hence disturb peak levels there as well. The more bit reduction, the more the overshoot. The overshoot happens at the output of the encoder, where it cannot be touched for additional peak limiting. Hence, the output of the decoder will also contain the peak overshoot.

Audio codecs using SBR [Spectral Band Replication], such as HE-AACv1/v2 and the HD Radio codec need even more headroom, since the SBR causes additional overshoot. So, in order to prevent these systems from overload and clipping, the easiest way to insure against this is to reduce peak audio levels into the encoders to at least -3 dB and allow overshoot headroom. If adequate overshoot headroom is not given, bad things will happen, and it depends upon the exact system as to exactly what will happen.

Foti: It is possible for lower bitrate channels to offer high quality and clear intelligibility through the use of a dedicated processor that employs the means to understand and handle the challenges of the coded audio path. For those who wish to tweak on their own, with existing processing equipment, the following should be observed: Avoid dense processing that contains fast limiting time constants. Try to reduce the attack time on functions when 5 dB, or more, depth-of-compression is desired. This will reduce upper frequency processor induced IMD.

Make sure that the coding system provides full headroom. If the system clips on its own before 0dBFS, then reset the maximum input level to avoid system headroom problems.

Low bitrates benefit from bandwidth control. A static low pass filter will reduce artifacts. The tradeoff is perceived high frequencies vs. quality. A specialized processor for coded audio will offer some dynamic method to accomplish this.

Do not use any final limiter that contains a clipper. The THD generated by the clipping function will cause more trouble than it’s worth. Precision peak control is needed in the coded system. A specialized processing system for this medium will provide a look-ahead limiter to accomplish this task.

Be mindful of system headroom. Set the processing system to operate with an output level set no greater than -3 dBfs. Allowing 3 dB of headroom will remove any possible distortion occurrences due to less than adequate digital-to-analog converters downstream.

If the above items are followed, improved coded audio will result.

Orban: As for the differences in approaching processing, this depends on your goals. If you want the stream to sound like radio undefined, then except for the peak limiting, you can use the same processing chain, including elements like AGC, stereo enhancement, EQ, and multiband compression.

If you want to sound like the original recording, then the processing can be as simple as static normalization of the source file to a target BS.1770 integrated loudness. However, static loudness normalization can cause inconsistencies at program boundaries, so I prefer adding some sort of online audio processing such as a simple AGC that normally does perhaps 3- 4 dB of gain reduction. This is enough to smooth out most transitions, if the source files are already loudness-normalized. In either case, the program can often benefit from left/right phase skew correction, which makes the audio easier to encode, applied before other processing blocks.

In all cases, it is important in streaming to allow headroom for codec overshoots which can either cause clipping in player devices, or trigger a peak limiter of uncertain quality in the player. With typical low-bitrate streaming [32 or 48 kbps HE-AACv2], I recommend allowing 3 dB of headroom. It is also important for the peak limiter to be “true-peak” aware, so that it anticipates the peak level that will appear after the player’s DAC, which can be several dB higher than the highest digital sample.

The AES document AES TD1004.1.15-10 [“Recommendation for Loudness of Audio Streaming and Network File Playback”] recommends a BS.1770 Integrated target loudness of -16 to -20 LUFS. This is low enough to produce little peak limiting, thereby allowing a simple look-ahead limiter to produce good results, while being high enough to achieve satisfying listening levels on typical player devices like iPhones.

Many streaming providers choose higher target loudness because of the usual loudness wars concerns of sounding wimpy and getting lost on the dial, or the streaming equivalent thereof. But if you allow 3 dB of peak headroom, then going more than a few dB above the AES recommendation is likely to degrade audio quality because of peak limiter artifacts.

Keith: One way to think of the difference is to compare the usual goals in both cases. On-air processing is typically quite aggressive, mainly because stations generally want to be louder than their competition. The loudness goal is further exacerbated by the ability to instantly flip back and forth between stations in order to compare loudness.

While radio people find loudness to be a critically important criteria, most listeners could care less about it.

In streaming applications, achieving maximum loudness isn’t as important as creating a stream that can be listened to for long periods of time. Also, comparing loudness is much more difficult in the streaming case because of the buffering processes within the streaming technology and interconnecting networks that make it impossible to do instant loudness comparisons.

GARBAGE IN VS. GARBAGE OUT

Any radio engineer who has dealt with audio processing knows about the “garbage in versus garbage out” concept: If the audio going in to a processor sounds bad, the audio coming out of the processor will sound bad. The obvious implication is that you should do all you can to ensure the source material is as clean as possible.

Ogonowski: Good processed audio results are completely dependent upon the quality of the source audio. There is only so much that can be done to fix poor sources in audio processing, especially if sources are coded audio.

Storage is cheap today, and computer systems are more than fast enough to use linear PCM formats, such as .wav of .aiff. MP3 should never be used.

If coded audio must be used for whatever reason, AAC at 256 kbps should be used, such as that from the iTunes Music Store. It should be remembered that these sources will then be coded by the streaming or digital radio encoders, so encode-decode cycles should be kept to a minimum to deliver the best audio quality to the listener, which is what counts.

Many canned libraries available to broadcasters have varying levels of quality, ranging from OK to poor. If you want this done right, do it yourself, and get your own sources from known-quality CDs or record company files.

The media through which we reach our listeners has evolved over time, but certain fundamentals of audio processing remain the same. The final principle — and perhaps most important — is that you need to care about the end result.