Applying
Stenography to Music Captioning
Lukasz Grzegorz Maciak
Micheal Alexis Ponniah
Renu Sharma
Non
Stegonographic Methods Summary.
Flaws of
Encode Time MP3 Stegonography
Post
Encoding MP3 Stegonography
DESIGNING
STEGONOGRAPHIC SOFTWARE FOR MP3 FILES.
Stegonographic
Module – Implementation Notes
Stegonographic
Module - Padding Byte Stuffing
Stegonographic
Module – Implementation Issues
IMPLEMENTING
GRAPHICAL USER INTERFACE FRONT END
MP3
Player Implementation details:
The goal of this project is to embed textual information into a popular media using stegonography. It can be assume that the text is relatively short when compared to the media file. A good example of this is the relationship between a recoded song, and it's lyrics. The audio file containing the recording is much larger than the song lyrics stored as a plain ASCII files. Therefore it is probably safe to assume that the smaller file could be stegonographically embedded into the larger one without impacting the quality. Similar argument could be made about video data and close captioning information.
This project concentrates on the song/lyrics dynamics in order to create a stegonographically driven karaoke machine. The song lyrics will be seamlessly embedded into an audio file, and then displayed on the screen when the file is played. This research will include implementation of stegonographic algorithm for encoding data inside audio files, as well as technique to dynamically extract that data and play it back.
The MP3 format is a very good target for this research because it is currently one of the most popular music encodings. Potential users of karaoke software are most likely to pick the mp3 format above any other audio encoding on the market. Because the end goal of this project is to create a usable piece of software, catering to the tastes and needs of the end users seems to be a good idea.
Furthermore, mp3 is an open standard which means that it is well documented and accessible. Thus the uncovering the inner workings of this format does not pose any legal threats to the researchers. On the other hand, choosing a proprietary, closed format such as Windows Media Audio (WMA) could put the researchers in legal jeopardy.
Doubtlessly any stegnographic research will rely heavily on exploiting certain properties of the data format chosen as the information carrier. This project is no different. However the actual research process, data examination and implementation steps could be replicated for other media to create analogous solutions.
The implementation proposed in this paper relies heavily on the specific properties of the mp3 data format. Therefore it is only logical to start the discussion by reviewing the structure of these media files.
The mp3 format is designed to store audio data, which is different from visual information stored in images. Therefore image stegonography techniques may not always work with audio data. Furthermore, unlike some image data formats, mp3 files are compressed and encoded in a very storage-conscious way. Thus they are not the best host files for stegonographic data.
MP3 is a lossy data format which aims to preserve the sound quality while minimizing storage space. The encoding process takes into account the properties of human auditory system. For example, humans cannot hear frequencies below 20Hz and above 20kHz. Furthermore human ear is often unable to distinguish between two or more notes with specific frequencies when they are played together. Thus mp3 file can safely discard any sounds with frequencies out of the audible scale, and needs to store only a single copy out of a group of similar sounding notes. This is of course not a trivial process. Mp3 encoders employ a complex psychoacoustic modeling to perceptually optimize the data.[1]
The encoding process is very complex and involves both perceptual optimization, as well as more conventional data compression methods. Figure 1 shows a conceptual model of an mp3 algorithm:

Figure 1
Explaining the actual encoding procedure is out of the scope of this paper. However there are several tasks performed by that the encoder that may be of some interest to a stegonography researcher.
The mp3 encoder breaks the audio data into small fragments called frames. Each frame represents a fraction of a second. The size of the frame depends on the audio resolution or bit rate. The most convenient (algorithmically) to do this is to assume a constant bit rate throughout the recording thus forcing same size onto all frames. However music is not structured this way. Very often very dynamic sequences including vocals and many instruments playing at the same time are interweaved with very simple melodic tracks. Therefore using a constant bit rate (CBR) is not always economical. MP3 specification allows the data to be stored in a variable bit rate format (VBR) which means that the audio frames are not the same size. [1]
Each frame is perceptually analyzed using the psychoacoustic model. The frequencies that are not audible are discarded, or allocated minimal number of bits. The exact inner workings of this procedure are complex and beyond the scope of this paper.
Once perceptual optimization is done, the data is compressed using Huffman coding. This is a lossless algorithm so the audio information is preserved, while decreasing storage space. [1] This is an important fact for a stegonographer.
Because of the nature of the compression algorithm, Huffman coded data cannot be easily modified. Huffman coded data is stored using variable length bit strings that are matched against a lookup table. The most frequently used characters are encoded with the shortest possible strings, while the rare ones are coded with longer strings. Thus it is possible that certain values have two or three bit codes. [4] Inverting a single bit therefore can completely change a value of the coded data.
Furthermore the data cannot be easily divided into bytes, words and etc. So the least significant bit of a given byte may actually be the most significant bit of a Huffman coded character. Therefore least significant bit substitution cannot be easily done on Huffman coded data.
Compressed audio data is then reassembled. Each frame is pre-pended with a header which stores information about the bit rate, sample rate and other meta-data [1].
MP3 files are therefore composed from short data frames, padded with headers. MP3 file can also contain some meta-data tags. There are two types of these tags. ID3v1 is the older format which is post-pended at the end of the file. This tag is always 128 bytes long and it contains seven fields which specify the artist name, song title, album, genre and etc… Because of it’s static size, and lack of flexibility, this tag type is slowly replaced by the more advanced ID3v2 standard. [6]
The newer, more flexible ID3v2 tags are pre-pended to the file. Their structure is almost as flexible as the structure of the mp3 file itself. ID3v2 tags are composed of their own frames which store various bits of information. This might be the standard character strings such as artist name and song title or more advanced information about the way the file was encoded. ID3v2 tags can be used to provide useful hints to the decoder. As an example, equalization curves are often stored in ID3v2 tags. There is no set size limit on ID3v2 tags so in theory they can grow indefinitely. [5]
MP3 files in circulation can include either tag type. There is no clear preference, so a stegonographer has to be prepared to deal with information tags present either before or after the audio data stream. However, it is logical to assume that ID3v1 tags will become increasingly rare in the future. Figure 2 below shows a conceptual model of an mp3 file:
![]()

Figure 2
Due to their extendibility the ID3v2 tags would be an interesting target for embedding information, however they are not guaranteed to be present in every mp3 file. Thus the best approach is to embed the data into the data frames. Before discussing stegonographic methodology however, it would be best to take a closer look at the data frame.
As it was mentioned before, MP3 files can be encoded with variable frame rate (VBR) which in fact makes the frames vary in size. Since the frame sizes are not obvious it is necessary to be able to identify where a frame starts and where it ends. This is not as difficult as it would first appear. Each frame is pre-pended with a frame header. All headers are very similar in structure and content. In fact, they will often be identical. Thus, identifying an mp3 header is just a matter of pattern matching.
Each header starts with a 12 bit block called the Sync block (see Figure 3). The Sync is a string of ones which is supposed to help the decoder to home in on a header. Therefore to find a frame one simply needs to detect a 12 consecutive bits initialized to be 1.

Figure 3
However, this pattern is not necessarily unique to a header. In fact this pattern can be easily found in any longer data block. There are few other checks that can be performed to identify a 4 byte data block as a header:
A 4 byte block which starts with the Sync and does not violate the conditions listed above is probably a header. [7]
Figure 4 shows an alternative view of the mp3 header in which the fields are marked with characters. Table 1 provides brief explanations of each field.

Figure 4
Table 1
|
Position |
Purpose |
Length |
|
A |
Frame sync |
11 |
|
B |
MPEG audio version (MPEG-1, 2, etc.) |
2 |
|
C |
MPEG layer (Layer I, II, III, etc.) |
2 |
|
D |
Protection (if on, then checksum follows
header) |
1 |
|
E |
Bitrate index (lookup table used to specify
bitrate for this MPEG version and layer) |
4 |
|
F |
Sampling rate frequency (44.1kHz, etc., determined
by lookup table) |
2 |
|
G |
Padding bit (on or off, compensates for
unfilled frames) |
1 |
|
H |
Private bit (on or off, allows for
application-specific triggers) |
1 |
|
I |
Channel mode (stereo, joint stereo, dual
channel, single channel) |
2 |
|
J |
Mode extension (used only with joint stereo,
to conjoin channel data) |
2 |
|
K |
Copyright (on or off) |
1 |
|
L |
Original (off if copy of original, on if
original) |
1 |
|
M |
Emphasis (respects emphasis bit in the
original recording; now largely obsolete) |
2 |
Frame size is a function of bit-rate and sampling frequency. The size of a given frame in bytes can be obtained using the following equation:
Equation 1
