A Digital Audio Primer
So it’s 1995, and the Digital Audio Revolution™ is all anyone can talk about. CDs, wave files, MP3s, sound cards, sampling – what the hell does it all mean?! And now it is all back in the spotlight, thanks to the WaveFile Ruby gem. Fortunately, your hip, tech-savvy friend Joel is here to help.
Audio 101: Sounds Are Waves
Let’s start at the beginning. Sound occurs when something moves. This movement causes the air around the object to compress and decompress in a wave. Your ears can detect these compression waves, and your brain interprets as sound. Examples of things that can move and cause sound: vocal cords, guitar strings, exploding objects, or the speakers on your Hi-Fi system.
You can plot sound waves on a graph. Here is what the opening shout of Help! by The Beatles looks like:
The most basic sound is a sine wave:
A sine wave sounds like a beep. You might recognize it if you ever took a hearing test as a kid, or if you have a filthy mouth.
Waves have several important properties. The frequency is how often a wave pattern such as repeats. The higher the frequency, the higher the pitch.
A repeating wave pattern is called an oscillation or a cycle. Frequency is expressed in hertz, or cycles per second. If a cycle such as repeats 50 times in one second, it has a frequency of 50Hz. People can generally hear frequencies between about 20Hz and 20,000Hz. (As you get older, you gradually lose the ability to hear higher frequencies).
The period of a wave is the amount of time it takes for one cycle to complete. If a wave has a frequency of 50Hz, then it has a period of 1/50th of a second.
Try moving the slider to change the frequency of a sine wave. Notice how as the frequency goes up, each cycle happens more quickly, and the period goes down.
The amplitude of a wave is the distance on a graph between its maximum height and 0. Amplitude determines how loud a sound is perceived. The higher the amplitude, the louder the sound.
Try moving the slider to change the amplitude of a sine wave. Notice how as the amplitude goes up, the sound becomes louder, and the height of the wave increases.
All of the graphs above are normalized, so that the maximum possible amplitudes of sound from a source (such as a pair of speakers) are labeled 1.0 and -1.0. This convention is used in the audio world to simplify things. Of course, in the real world the exact loudness represented by an amplitude of 1.0 or -1.0 is relative.
Analog Waves vs. Digital Waves
So our humble sine wave is wafting through the air, and we want our computer to capture it and store it for later. To do this, we need to convert the sound wave from an analog form to a digital form. This process is called sampling, and is necessary because computers can only store data digitally.
Analog waves are continuous. Notice how the sine waves in the previous section are smooth. There are no gaps anywhere – if you were to zoom in, and keep zooming in into infinity, the line would always be smooth. If you have two points in the wave, there are an infinite number of points between them.
In contrast, a wave in digital form consists of a series of instantaneous “snapshots” of the amplitude of the wave over time. Each snapshot is called a sample. What’s cool is that a digital wave is just a list of numbers, so it can be easily stored and used on a computer, unlike an analog wave. What’s even cooler is that if you take the samples fast enough, the digital collection of samples is mathematically equivalent to the analog wave, and you can convert back and forth between the two. This means you can store your samples digitally on a computer for later use, and then convert them into an analog form when you want to play them on your speakers.
A Sample of Sampling
“Sampling a signal” means to record instantaneous amplitudes (i.e. y-values) at a regular time interval, and put them in a list.
This animation shows an example of sampling a sine wave. The amplitude of the wave is measured at a regular interval (called the sample rate), and placed in the list below.
Because the same amount of time is elapsed between each sample, the samples are all the same distance apart on the x-axis. (The x-axis denotes some arbitrary amount of time).
This list of samples can be saved into a file format such as
*.aiff (glossing over some technical details). When you play it back, your computer or stereo can convert the list of digital samples back into an analog wave so that it can be played on your speakers. Note that in real life, you would need many more samples to create a sound that lasts long enough to be heard.
Here’s an example of how this all can work in the real world. Suppose you want to record yourself playing guitar, using a microphone plugged into your computer. As you strum, the microphone will detect the analog sound wave and convert it into an equivalent analog electrical signal (also a wave). This signal will be sent to your computer’s sound card, which will sample it many times a second to convert it to digital form. You can then save this digital audio as a Wave file, MP3, etc. When you use a program to play it back, the samples will be sent back to the sound card. It will convert the digital samples back into an analog electric signal and send it to your speakers, causing the speaker cones to move. The speaker movement will cause a sound wave to move through the air, which your ears will detect. You will frown at realizing you flubbed that chord.
An important property of digital audio is the sample rate. This represents how many times a wave is sampled per second, and is expressed in Hertz. A common sample rate is 44,100Hz, which means a sample is taken every 1/44,100th of a second. Or put differently, it means that 44,100 samples are used to create 1 second of sound.
The sample rate determines how accurately the frequency of the underlaying signal being sampled is captured. The higher the sample rate, the larger the number of frequencies that can be accurately represented.
Specifically, to accurately sample a frequency of x, you need to sample at a rate of 2x or higher. If you want to sample a sound where the highest frequency is 5,000Hz, sampling at a rate of 10,000Hz or higher will allow you to capture it accurately. Or, to say the same thing differently, if you sample at a rate of x Hertz, you can accurately capture frequencies up to 0.5x Hertz. This is known as the Nyquist–Shannon sampling theorem. The frequency of half the sample rate is called the Nyquist frequency.
However, an important consideration is that the highest frequency in the source audio should not be higher than the Nyquist frequency. This is because a high frequency signal can end up corresponding to the exact same samples as a lower frequency signal. This is called aliasing. It causes frequencies above the Nyquist frequency to end up as artifacts in the final sampled audio. In effect, it causes the artificial addition of lower frequency signals not present in the original audio. To prevent this, you can either increase the sample rate to be twice the highest frequency (which may/may not be practical), or filter these high frequencies out before sampling.
Since CD audio is sampled at 44,100Hz, CDs can accurately reproduce frequencies up to 22,050Hz. Humans can hear frequencies up to around 20,000Hz, so this means CDs can essentially capture the range of human hearing. (As long as frequencies higher than 22,050Hz are filtered out before sampling to prevent aliasing distortion). Historically, telephone signals used an 8,000Hz sample rate. This allowed capturing most, but not all, of the frequencies of human speech. This meant you could understand someone talking, but it sounded a little muffled.
Bits Per Sample / Sample Depth
While the sample rate determines how accurately we can represent analog frequencies with digital audio, the sample depth determines how accurately we can capture amplitudes. Or put differently, whereas the sample rate determines resolution on the x-axis, the sample depth determines resolution on the y-axis. The sample depth determines the difference between the quietest sound we can capture, and the loudest sound. This difference is called the dynamic range.
Digital audio commonly represents each sample as a 8, 16, or 24-bit integer. 8-bit numbers can encode 256 different amplitudes, 16-bit numbers can encode 65,536, and 24-bit numbers can encode 16,777,216. Since an analog signal is continuous and has an infinite number of possible amplitudes, but 8/16/24-bit integers only allow a certain number of possible values, this means that the sampled signal in real world digital audio can’t 100% accurately capture the amplitude of a signal. The process of converting a sampled analog amplitude to one of the possible integer values is called quantization. The number of bits used to encode each sample is called the bits per sample.
Try choosing different values for the bits per sample. Listen to what these different sample depths sounds like, and notice how closely (or not closely) the digital samples match the original analog wave.
Notice how the original analog signal changes in amplitude over time. The higher the bits per sample, the higher the dynamic range, and thus the more accurately the fade in/out is captured in the digital signal. At 16 bits per sample, you can hear the fade in/out clearly. At 8 bits per sample, if you listen closely with headphones the quieter part of the signal is a little distorted. At 4 bits per sample the sound is very distorted due to there not being many “buckets” available for the amplitude to be mapped to. At 1 bit per sample, the fade in/out disappears completely!
However, notice that the tone still sounds like the same pitch at each sample depth. This is because the sample depth doesn’t affect how accurately frequencies are captured (it only affects amplitudes). If you record a piano playing a middle A note at 440Hz with a high sample rate but low bits per sample, it might sound noisy and distorted, but it will still sound like an A note.
Although digital audio commonly uses integer samples, it’s also possible to use floating point numbers (normally in the range of -1.0 to 1.0). Floating point numbers can provide an even larger number of amplitude value “buckets” than 24-bit integers, and thus capture a wider dynamic range.
Number of Channels
When audio consists of a single sound wave, it is said to have 1 channel, or to be monophonic. Audio that has 2 channels is called stereo, and consists of two separate sound waves that are played at the same time. One sound wave is sent to a left speaker, and the other to a right speaker. This allows an immersive effect, most noticeable when listening with headphones. For example one instrument can be played in your left ear and another in your right ear. Or, a drum roll can sweep from left to right and back again. Audio CDs are stereo.
Audio with 3 or more channels is less common, but allows for an even more immersive experience. Surround sound, which generally uses 6 channels, allows for sound to come from in front of and behind you, in addition to just left and right.
Creating Your Own Digital Audio
You probably want to write programs to create your own sounds. (Why wouldn’t you?) To do so, your program needs to generate a list of samples, and then send it to your sound card for playback. Sounds simple, but the details can be more complex. First, you need to know how to generate the right samples so it actually sounds like something. Second, you need to know how to actually send the samples to your sound card.
You can work around the second problem by saving your samples to a sound format such as MP3 or Wave, and letting another program handle the playback for you. The Wavefile Ruby gem makes it easy to create Wave files using Ruby. The example below shows a simple program outline you can use.
require 'wavefile' include WaveFile # Should return an array of numbers between -1.0 and 1.0 def generate_sample_data amplitude = 0.3 # Create square wave cycle, which alternates between the same # positive and negative amplitude. Since the cycle is 100 samples # long, if repeated it will have a frequency of 441Hz when the # sample rate is 44,100Hz. one_square_wave_cycle = ([amplitude] * 50) + ([-amplitude] * 50) # Return square wave cycle repeated 100 times, about a quarter # of a second at a 44,100Hz sample rate. one_square_wave_cycle * 100 end samples = generate_sample_data Writer.new("mysound.wav", Format.new(:mono, :pcm_16, 44100)) do |writer| buffer = Buffer.new(samples, Format.new(:mono, :float, 44100)) writer.write(buffer) end
To learn more about different ways of creating the samples for different pitches, amplitudes, and wave forms using Ruby, check out this article about NanoSynth.
Compact Discs, Wave Files, and MP3s
Throughout this article we’ve talked about different ways of storing digital audio, such as CDs, Wave files, and MP3s. Let’s look at these a bit more in depth.
First, audio CDs. The surface of a CD contains millions of tiny pits. A CD player uses a laser to read the pattern of pits, and converts it into a series of 1s and 0s. CDs use 16 bits per sample, so every 16 0s or 1s represents one sample. CDs store stereo data, so separate samples are stored for the left and right speakers. Since CDs use a sample rate of 44,100Hz, every second of sound requires 88,200 samples. (44,100 samples for the sound in the left speaker, and 44,100 samples for the right speaker). When you play a CD, a stream of samples is read from the CD surface and sent to a digital-to-analog converter (DAC) inside the CD player. The resulting analog signal is sent to your speakers or headphones, which converts the signal to a sound wave via movement of the speaker cone.
Wave (.wav) and AIFF (.aiff or .aif) files on your computer are conceptually similar. Like a CD, a Wave/AIFF file mostly consists of a long stream of raw samples. (They can use compression, but I’m not sure how common this is). The beginning includes a small header that indicates the bits per sample, sample rate, etc., but the rest is mostly just samples. When you play a Wave/AIFF file, your computer sends this list of samples to your sound card, which converts it to an analog signal and sends it to your speakers. The documentation for the WaveFile gem has more info on the Wave file format.
In an MP3 file, the list of samples is compacted to take up less space. The compaction process takes into account the peculiarities of the way our brains hear sound. This allows it to throw away some sample data, most of which we wouldn’t actually be able to hear. The end result is a file which is much smaller than a Wave file, but with similar (though not quite as good) sound quality. Handy if you are downloading songs from the Internet.
Well, that should cover the basics. You should now have some basic background information to help you write programs that work with sound, as well as sound smart at parties.
- Sound waves can travel through any physical medium, not just air. For example water, or solid objects like a desk or wall. ↩
- "Sample" is an overloaded term. It can refer to either an instantaneous point in a wave, like we have used it in this article, or in other contexts it can refer to a short snippet of sound, often taken from a pre-existing song.↩
- In the real world, sampling never results in a digital wave that is a perfect match with the original analog wave, for a variety of reasons. We'll cover a few later in this article.↩