A Look at the Proposed MPEG Standard for Sending Surround and Stereo Over the Same Channel
Much of the discussion regarding surround sound on digital radio worldwide these days has involved a developing specification in ISO/MPEG called Spatial Coding, or SC.
On its surface, it is a mechanism that allows a stereo or mono audio signal to be sent in its usual form, but accompanied by a small auxiliary data stream that describes how a surround mix of the current signal would be created.
Legacy receivers just ignore this aux data stream and play out the stereo or mono audio as usual, while SC-enabled devices interpret the aux data and apply it to the same audio signal to recreate the surround mix. The system is codec-agnostic, so it could conceivably be applied to any transmission or storage scheme. It also is scalable over a wide range of input and output channels (meaning that it is not fixed at encoding 5.1 audio into stereo, but could also be used to extract 10.2-channel audio from a mono signal, for example).
Conceptually this seems simple enough, and also sounds like a great solution for managing digital radio transmission that addresses a variety of emerging content and listening environments – just as the stereo multiplex provided backward compatibility to existing FM mono transmission in the 1960s.
But for those not intimately familiar with the technologies involved, how the system pulls this off seems hard to fathom when you actually start to think about it. For those accustomed to matrix surround, it’s hard to understand how Spatial Coding can faithfully recreate a surround signal using a very low bit rate data channel (~5 kbps), even from a monaural audio feed. (Matrix surround always requires at least a stereo transmission channel, hence its “4-2-4” nomenclature.) The system allows the use of either the same audio signal
Perhaps hardest of all to grasp, however, even for those comfortable with other 5.1 coding systems (like AC-3), is how the system can allow the use of either a downmixed surround signal for the audio, or a wholly separate “artistic stereo” mix, and still recreate an acceptable multichannel presentation at the receiver. (This implies that the audio signal seen by the decoder may be different than the one used by the encoder in generating the spatial data signal.)
So to sort this all out, let’s dig in a bit to the system’s interesting design.
Do it with frequency
Like most perceptual audio coding systems, MPEG Spatial Coding does most of its work in the frequency domain. This means that multichannel source audio is first converted from the time domain to the frequency domain, and analysis of each audio channel is then done in so-called critical bands, which are based on how the human hearing sense perceives sound. (The bandwidths of critical bands are set to the minimum frequency resolutions of human hearing a various frequencies – some bands are wider than others – and they are the basis for spectral masking algorithms used by all perceptual audio coding systems.)
Instead of using this analysis to reduce digital audio coding bit rates, however, Spatial Coding uses it to extract spatial cues from each band. Such cues are derived by comparing the channels against one another for level and phase differences within each spectral band. These deltas between pairs of channels can then be robustly encoded using a relatively small amount of data, which is sent to the receiver via a data side-chain transmission. (This technique is adapted in part from the older joint-stereo coding technique used by some perceptual coders.)
Also included in this data signal are prediction signals that help the system manage how audio elements spread over groups of channels are mapped, which is conceptually similar to the steering signals used in advanced matrix systems to aid in image stability. A final component of the data describes the actual audio signal’s dynamic deviations from those fixed prediction models – a kind of steering “servo” signal.
Manual or automatic transmission
As this spatial data signal is sent to an aux data output, the multichannel audio signal is meanwhile downmixed to stereo (or mono, if necessary), then reconverted to the time domain for presentation to the transmission or storage system’s coding and modulation components.
Alternatively, a wholly separate “artistic” mix (or “handmade downmix”) can be substituted at this point, such that the content transmitted or stored will be this alternate signal rather than an automatic downmix of the multichannel audio. In either case, legacy decoders will encounter only the stereo (or mono) signal, while new systems will apply the data channel’s spatial coding to the same audio signal and derive a multichannel output.
As noted earlier, the spatial data is adequately robust for it to extract a multichannel mix even when an artistic audio input is transmitted instead of the original multichannel audio’s downmix. Nevertheless, a relatively new feature of the system allows the SC encoder to compare its input and output audio signals, and if it detects a substantial difference – as it might in some cases where the artistic mix option is selected – it can adjust its spatial coding data’s parameters so they are optimized for the decoder to reconstruct the multichannel audio signal from the artistic stereo mix instead of the encoder’s own automatic downmix.
There are several other clever techniques used in the MPEG-SC system that improve its performance and efficiency. The system also offers quite a bit of encoding adjustment and scalability, along with the ability to remain transport-agnostic, allowing it to be used across a variety of applications besides digital radio broadcasting. (The spatial data channel includes a metadata block that communicates these settings to the decoder for optimum performance and extensibility.)
To learn more about this system’s inner workings, see AES Convention Paper 6447, “The Reference Model Architecture for MPEG Spatial Audio Coding,” presented at the 118th Convention, May 2005, Barcelona, Spain.