Perceptual audio coding
Feb 1, 2001 12:00 PM, By Kevin Nos�
Chances are very good that you have been exposed to the artifacts of perceptual audio coding without even knowing it.
Perceptual audio coding, in a general sense, is a method for reducing the amount of data required to represent a digital audio signal. The method is inherently lossy, and causes the reconstructed signal to bear a certain amount of noise that would easily fall within the range of human hearing if analyzed alone. However, perceptual audio coding specifically creates noise that falls outside the range of human perception when heard in the context of the original signal. This distinction is very important, and it sets perceptual audio coding schemes apart from other schemes, such as u-law or ADPCM, that don’t take advantage of human hearing limitations.
The idea behind perceptual audio coding is that the presence of certain auditory stimuli can influence the brain’s ability to perceive certain other stimuli. Put simply, some sounds can drown out, or mask, other ones. A coding process can take advantage of this by not encoding those aspects of the audio signal that would be masked from the listener. Several coding schemes today make use of this premise, including MPEG layers 1-3 and AAC, Microsoft’s Windows Media Audio, Lucent’s PAC, ATRAC (used for minidiscs), and some of RealNetwork’s Real Audio codecs.
The figure shows the key components of a single channel perceptual audio coding chain. In the encoder, the incoming audio signal is broken down into multiple bands across the frequency spectrum by a transform process. Once transformed, the data contained within each band can be treated independently, allowing individual bands to be represented with varying degrees of resolution. When resolution is reduced in a particular band, the amount of quantization noise will increase around that band’s corresponding part of the audio spectrum. An inverse transform is performed in the decoder to combine the multiple bands and restore the audio signal. In the case where no resolution is removed from any band, the transform/inverse transform process is ideally lossless.
After the transform, each band in the encoder is restricted to an appropriate amount of resolution to satisfy the target coded bit-rate while maintaining as much detail in key bands as possible. The encoder is constantly analyzing the incoming audio signal and making decisions on where in the spectrum and to what extent noise can be masked from the listener.
Finally, the bitstream format stage takes the transformed, minimized data and assembles it into a bitstream that the decoder can understand. Additional information is included with the audio data at this stage that identifies the bitstream as a specific type with specific operating parameters, such as sample rate and bit-rate. The bitstream formatting and decoding stages can also make use of various error detection and correction techniques if necessary.
The effectiveness of a perceptual coding scheme depends on how accurately it can match the perceptual limitations of human hearing, but it will also depend on having enough transmission bandwidth to support all the detail that the human sense of hearing is capable of perceiving. Thanks to the promise of ever-improving bandwidths, this will be less of an issue in the future, and perceptual audio codecs can continue to be the highest quality lossy transmission scheme you’ve never noticed.
Kevin Nos� is president and director of engineering of NeoSonic Industries, Cleveland.