Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now


Radio Applications of MPEG-7

Practical Uses Are Emerging for This Metadata Standard

Practical Uses Are Emerging for This Metadata Standard

Yes, Virginia, there really is an MPEG-7 – and an MPEG-21, too, since you asked.

Many readers may wonder how the numbers got that high so quickly. Wasn’t it just a few short years ago that MPEG-2 was issued? And hasn’t there been more recent discussion about MPEG-4? Did we miss something important?

Perhaps some review is in order.

Let’s start at the beginning. In 1988, the Moving Pictures Expert Group (MPEG) of the International Standards Organization (ISO) was formed, and subsequently issued a specification for the perceptual coding of digital audio and video signals.

Retrospectively, this specification was named MPEG-1, after the next set of specs was issued by the group as “MPEG-2” a few years later (in a process similar to our application of “WW-I” to what had been called “the Great War” prior to WW-II).

The audio coding components in MPEG-1 and MPEG-2 were called Audio Layers I, II and III, which gave rise to audio professionals’ shortcuts of “MPEG Layer II” and “Layer III” (or the occasional misnomer of “MPEG-2” and “MPEG-3”), and the now well-known consumer audio file format “mp3.”

MPEG-2 video coding has since become the world standard for DTV and DVD, and for a while it was envisioned that an MPEG-3 family of codecs would occupy the same functionality for HDTV. But as it turned out, MPEG-2 had a range of scalability – specifically, a matrix of profiles and levels – that allowed it to accommodate the requirements of HDTV, so the MPEG-3 work was dropped.

Jumping ahead

Meanwhile, streaming media on the Internet was developing. The MPEG-4 initiative was started in the late 1990s, at first to address the specific needs of low bit-rate media representation in this environment.

Yet unlike its predecessors, which were targeted at fairly specific applications, the MPEG-4 process grew to encompass a wide range of different elements. About the only thing they all shared was their application to areas not especially well served by MPEG-1 or -2 (such as online media, gaming, datacasting, wireless telecom, graphics, object-oriented representation, etc.).

So today, MPEG-4 includes a number of widely divergent “parts,” each of which specifies a different technology for a targeted application. It also includes the ability to integrate digital rights management (DRM) to media content.

So now, on to MPEG-5, right? Well, no.

There are numerous theories why the MPEG organization decided to skip directly from MPEG-4 to MPEG-7, ranging from the sensible – e.g., it looked like the MPEG-4 work would be so broad that it might be broken into several different initiatives, so like the assignment of street-address numbers by the post office, room was left for adjacent expansion – to the quirky – e.g., MPEG’s idiosyncratic leader, Leonardo Chariglione, simply wanted to leave people guessing.

The truth probably lies somewhere in between. Standards people have a unique world view, and often an offbeat sense of humor. You have to be a little wacky to sit in a room all day and discuss this stuff.

Even more questions of sequence were raised when the organization named its next initiative “MPEG-21,” and the range of reasons behind its derivation became even more inventive, to the point where it’s best just not to ask anymore.

A major shift

The discontinuities in MPEG’s numbering scheme do point out the substantial changes in focus among the group’s processes, however.

While all of MPEG’s work remains associated with digital audio and video media, the MPEG-1 through MPEG-4 efforts essentially are essence (i.e., program content) coding or representation systems, primarily intended to provide high-quality, efficient transmission or storage of digital media programs.

In contrast, MPEG-7 is a metadata specification, which provides a language for describing media content, and MPEG-21 intends to be an interoperable multimedia framework or a complete “content-delivery platform” specification, incorporating comprehensive content identification, usage-environment description and rights expression schemes. Some have proposed, only half-facetiously, that MPEG’s next efforts be called MPEG-33, MPEG-45 and MPEG-78.

Actual products are beginning to emerge from the MPEG-7 work. As you might expect, “traditional” metadata typically is the realm of program librarians or search engines; and indeed MPEG-7 provides a powerful and interoperable XML-based language and environment for sophisticated management, searching and filtering of the content it describes.

But MPEG-7 goes beyond this conventional approach to metadata and its “tagging” methodology to use what it calls low-level descriptors, which enable rich and direct descriptions of minute details about the content’s structure. Recently, a few applications have surfaced that use the audio low-level descriptors of MPEG-7 for some interesting and practical functions extending well beyond the world of library science.

Metadata defined

Most professionals today think of metadata in two basis categories:

Descriptive: Data about the content in terms of genre, the talent involved, materials included, the source pedigree/history of each content element, number of channels or tracks, rights conveyed, etc.;and Compositional data that describes how a program is put together, in terms of in/out edit points, language selection, multichannel mapping, etc., such that different versions of the show might be derived from a single batch of audio elements based on how the metadata is written.

In MPEG-7 terms, these metadata types are considered “high-level descriptors” or Description Schemes (DS), which are compiled from collections of low-level descriptors that are the result of detailed analysis of the content’s actual data samples and signal waveforms. MPEG-7 expresses these descriptions in XML, thereby providing a method of describing the audio or video samples of the content in textual form.

In this way, a program can be specifically described in a standardized form that can be parsed by a text processor and stored in a database. Among other things, this provides a reliable, convenient and computationally efficient way to identify a particular piece of content without actually decoding its essence data or analyzing its waveform, but rather by simply scanning its textual description.


An application of this technique with potential interest to radio broadcasters has emerged in a system called AudioID, produced by the Fraunhofer Institute for Integrated Circuits, the originators of Layer III coding.

Using Part 4 of the MPEG 7 Standard – officially, ISO/IEC International Standard 15938-4 – which covers multimedia audio description, it provides a robust and scalable method of recognizing and identifying audio programs, in the following manner:

Sound clips (e.g., songs, radio spots, voice tracks, etc.) are played into the AudioID system, which applies some preprocessing, then extracts certain spectral features from each sound based on specialized psychoacoustic principles, ultimately generating a representation of those features using MPEG-7 low-level descriptors.

This XML-based representation, called a fingerprint, is stored in a database under a given file identity (e.g., the name of the song, advertiser, voice talent, etc.), completing what is called the Training Phase of the system. Later, during the Recognition Phase, an audio program can be played into the AudioID system, which will apply the same processing and feature-extraction process, comparing the results to the fingerprint descriptions stored in its database.

If it finds a match with sufficient confidence, it will declare the identity of audio being auditioned in real time.

The system is designed to allow for significant distortions between the training and recognition phases, such that a sound clip that was loaded into the system with perfect fidelity directly from its original source might still be accurately identified if it is compared with a version that has been through multiple generations of recording, data compression, transmission or band-limiting (such as being played over a telephone line).

The psychoacoustics mentioned earlier play an important role here, making the system operate in a similar fashion to human hearing, which is quite tolerant of severe distortions when identifying a known sound. Moreover, the entire duration of the sound need not be played; reasonable confidence of recognition can often be obtained after playing only a few seconds of a song into the system, for example.

Name that tune

There are several practical (and potentially monetizable) uses for this system, which Fraunhofer has been impressively demonstrating around the world in recent months.

One is the automatic and reliable verification of a spot’s broadcast schedule, for use in as-aired reports and affidavits to advertisers. The flip side of this is the ability automatically to monitor and track down broadcasts or Webcasts of unauthorized content, such the airing of music or a program for which the broadcaster has not obtained proper clearances and rights.

This concept could even be extended to content protection schemes, in which such identification could invoke decryption of encrypted content. Remember that the system is not looking for metadata tags but actually examining description of the audio content itself, in a manner that is largely insensitive to the degradation typically encountered in content distribution chains. It is therefore difficult to spoof. It also does not require the audio to be specially processed ahead of time, such as through the addition of a watermarking signal.

Perhaps most interesting is the system’s ability to identify sound over the phone. A proposed application envisions a radio listener using his or her cell phone to call a radio station’s “ID line” and, when prompted, holding up the phone to the radio for a few seconds during a song that the listener would like to know more about (e.g., title, artist, album information for prospective purchase, etc.). The listener then hangs up and waits for an SMS message on the cell phone that provides the identification data within a few seconds.

Alternatively, the listener could call in on a land line and request that the system e-mail the information. Thinking commercially, those messages could include one-click links to purchase or download a copy of the audio, or the identification service itself could be offered via subscription. The same process could be applied to radio spots, allowing a listener to obtain further information about the advertiser.

The MPEG-7 standard likely will be adopted more widely and integrated into more products that apply its powerful content description capabilities, many of which may find valuable application within the radio broadcast industry.

For further information on MPEG-7, see