← Joel Strait

A Digital Audio Primer

So it’s 1995, and the Digital Audio Revolution™ is all anyone can talk about. CDs, wave files, MP3s, sound cards, sampling – what the hell does it all mean?! And now it is all back in the spotlight, thanks to the WaveFile Ruby gem. Fortunately, your hip, tech-savvy friend Joel is here to help.

Audio 101: Sounds Are Waves

Let’s start at the beginning. According to the textbooks, sound occurs when something moves. This movement causes the air around the object to compress and decompress in a wave[1]. Your ear can detect this compression wave, and your brain interprets as sound. Examples of things that can move and cause sound: vocal cords, guitar strings, exploding objects, or the speakers on your Hi-Fi system.

You can plot sound waves on a graph. Here is what the opening shout of Help! by The Beatles looks like:

Help! Wave Form

About the simplest sound is a sine wave:

Sine Wave

A sine wave sounds like a beep. You might recognize it if you ever took a hearing test as a kid, or if you have a filthy mouth.

As you might also remember from your schooling, waves have three particularly important properties – frequency, period, and amplitude. The frequency is how often a wave pattern such as repeats. The higher the frequency, the higher the pitch. In the two graphs below, the sound wave on the left below has a lower frequency, and will have a lower pitch, than the wave on the right.

High vs. Low Frequency

A repeating wave pattern is called an oscillation or a cycle. Frequency is generally expressed in hertz, or cycles per second. If a cycle such as repeats 50 times in one second, it has a frequency of 50Hz. People can generally hear frequencies between about 20Hz and 20,000Hz. (As you get older, you gradually lose the ability to hear higher frequencies).

The period of a wave is the amount of time it takes for one cycle to complete. If a wave has a frequency of 50Hz, then it has a period of 1/50th of a second.

The amplitude of a wave is the distance on a graph between its maximum height and 0. Amplitude determines how loud a sound is perceived. The higher the amplitude, the louder the sound. In the two graphs below, the sound wave on the right has a higher amplitude than the wave on the left, and will sound louder.

High vs. Low Amplitude

All of the graphs above are normalized, so that the maximum possible amplitudes of sound from a source (such as a pair of speakers) are labeled 1.0 and -1.0. This convention is used in the audio world to simplify things. Of course, in the real world the exact loudness represented by an amplitude of 1.0 or -1.0 is relative.

A Sample of Sampling

So our humble sine wave is wafting through the air, and we want our computer to capture it and store it for later. To do this, we need to convert the sound wave from an analog form to a digital form. This process is called sampling, and is necessary because computers can only store data digitally.

Analog waves are continuous. Notice how the sine waves in the previous section are smooth. There are no gaps anywhere – if you were to zoom in, and keep zooming in, the line would always be smooth. If you have two points in the wave, there are an infinite number of points between them.

In contrast, a wave in digital form is a finite list of discrete points. Each point is called a sample[2]. It is mathematically possible to perfectly recreate an analog wave from an equivalent digital wave, so you can convert between the two forms without losing information. (In real life though, this is not true for practical reasons, such as quantization. This is explained in section Bits Per Sample below).

To sample a wave (i.e., convert it from analog to digital), one collects instantaneous amplitudes (i.e. y-values) over a regular time interval (i.e. x-values), and puts them in a list. The graph below shows our sine wave, but now with circles indicating each sample we will take. Notice how they are all the same distance apart on the x-axis, indicating the same amount of time is elapsing between each sample. The numbers on the x-axis denote some arbitrary unit of time.

Sine Wave With Samples

Putting these samples into a list, we get this:

{0.0, 0.000} {5.0, -0.959} {10.0, -0.544} {15.0, 0.650} {20.0, 0.913}
{1.0, 0.841} {6.0, -0.279} {11.0, -1.000} {16.0, -0.288} {21.0, 0.837}
{2.0, 0.909} {7.0, 0.657} {12.0, -0.537} {17.0, -0.961} {22.0, -0.009}
{3.0, 0.141} {8.0, 0.989} {13.0, 0.420} {18.0, -0.751} {23.0, -0.846}
{4.0, -0.757} {9.0, 0.412} {14.0, 0.991} {19.0, 0.150} {24.0, -0.906}

We can simplify this. Since all samples occur at a regular interval on the x-axis, we can infer a sample’s x-value by its position in the list. For example, the first sample has an x-value of 0.0, the second sample has an x-value of 1.0, the third sample has an x-value of 2.0, etc. Therefore, keeping track of the each x-value is redundant and we can remove them. We are then left with just the y-values:

0.000 -0.959 -0.544 0.650 0.913
0.841 -0.279 -1.000 -0.288 0.837
0.909 0.657 -0.537 -0.961 -0.009
0.141 0.989 0.420 -0.751 -0.846
-0.757 0.412 0.991 0.150 -0.906

We have now sampled the wave. This list of samples can be saved into a file format such as Wave or MP3 for later playback (glossing over some technical details though). Note that in real life, you would need many more samples to create a sound that lasts long enough to be heard.

When playing back sound on a computer, the digital samples are sent back to your sound card, which converts the samples back to an analog electrical signal that can be sent to your speakers.

Let’s look at an end-to-end example of how this all can work in the real world. Suppose you want to record yourself playing guitar, and have a microphone plugged into your computer for this purpose. As you strum, the microphone will detect the analog sound wave and convert it into an equivalent analog electrical signal (also a wave). This signal will be sent to your computer’s sound card, which will sample it many times a second to convert it to digital form. You can then save this digital audio as a Wave file, MP3, etc. When you use a program to play it back, the samples will be sent back to the sound card. It will convert the digital samples back into an analog electric signal and send it to your speakers, causing the speaker cones to move. The speaker movement will cause a sound wave to move through the air, which your ears will detect. You will frown at realizing you flubbed that chord.

Creating Your Own Digital Audio

You probably want to write programs to create your own sounds. (Why wouldn’t you?) To do so, your program needs to generate a list of samples, and then send it to your sound card for playback. Sounds simple, but the details can be more complex. First, you need to know how to generate the right samples so it actually sounds like something. Second, you need to know how to actually send the samples to your sound card.

For an example of how to generate the right samples, check out this post about NanoSynth which describes how to create basic waves such as sine waves, square waves, etc. You can work around the second problem by saving your samples to a sound format such as MP3 or Wave, and letting iTunes/Winamp/whatever handle the playback for you. The Wavefile Ruby gem makes it easy to create Wave files using Ruby. The example below shows a simple program outline you can use as of v0.6.0. You would just need to implement the generate_sample_data() function.

require 'wavefile'
include WaveFile

# Should return an array of numbers between -1.0 and 1.0
def generate_sample_data
  # FILL ME IN
end

samples = generate_sample_data

Writer.new("mysound.wav", Format.new(:mono, :pcm_16, 44100)) do |writer|
  buffer = Buffer.new(samples, Format.new(:mono, :float, 44100))
  writer.write(buffer)
end

Bits Per Sample

In the code example above you might have noticed some terms we haven’t discussed yet. First, bits per sample. So far we have treated samples as real numbers between -1.0 and 1.0, but digital mediums (CDs, Wave files, etc.) usually store samples as integers. For example, audio CDs store samples as integers between -32,768 and 32,767 (equivalent to -1.0 and 1.0). Mapping samples from real numbers to a finite set of integers is called quantization. It causes information to be lost, so quantized data can’t be used to perfectly re-create the original analog signal. Such is life.

Bits per sample indicates the number of bits used for each sample’s binary integer representation. If you use n bits, then each sample can be one of 2n values. For example, if each sample is 1 bit, there are only two possible sample values (0 or 1). If each sample is 2 bits, then 22 (or 4) sample values are possible (00, 01, 10, or 11). At 3 bits, there are 23 (or 8) possible values, as so on. Most digital audio uses either 8, 16, 20, 24 or 32 bits per sample. Standard audio CDs use 16 bits, so they allow for 65,536 (216) different sample values.

Generally, the more bits per sample, the higher the sound quality. The graphs below illustrate the idea. The top-most graph shows part of an analog wave. The four graphs below show a digital representation of the wave at 1, 2, 3, and 4 bits per sample. Notice that as the bits per sample increases, the digital wave begins to resemble the original analog wave more closely.

Bits Per Sample Example

Although digital audio with more bits per sample is more accurate, the trade-off is that more space is required to store it. However, once you reach a certain point, you won’t notice any improvement as you add more bits. For example, 16-bit CD audio sounds just as good as 24-bit or 32-bit audio to most people. (Don’t tell this to an audiophile).

Sample Rate

Another important property of digital audio is the sample rate. This represents how many times a wave is sampled per second, and therefore indirectly also the number of samples that make up one second of sound. It is expressed in Hertz. The standard sample rate for CDs is 44,100Hz, meaning that 44,100 samples are used to create 1 second of sound.

Along with bits per sample, the sample rate helps determine the sound quality of digital audio. Whereas increasing the bits per sample increases resolution on the y-axis, increasing the sample rate increases resolution on the x-axis. The higher the sample rate, the more frequencies a digital wave can accurately represent. At a low sample rate, high frequencies will be distorted.

So how did the CD dudes come up with 44,100Hz? Well, with this sample rate you can accurately recreate frequencies up to 22,050Hz when converting from digital to analog. I mentioned before that humans can hear frequencies up to around 20,000Hz. This means that CDs can accurately produce the frequencies that people can hear, without wasting space on inaudible sound.

You might have noticed that 22,050 is half of 44,100. This isn’t an coincidence. According to the Nyquist–Shannon sampling theorem, if you sample at a rate of x hertz, you can accurately recreate analog frequencies up to 0.5x hertz. Or put differently, to produce a frequency of x with digital audio, you need to sample at a rate of 2x or higher. Since CD audio is sampled at 44,100Hz, CDs can therefore accurately reproduce frequencies up to 22,050Hz.

Number of Channels

When audio consists of a single sound wave, it is said to have 1 channel, or to be monophonic. Audio that has 2 channels is called stereo, and consists of two separate sound waves that are played at the same time. One sound wave is sent to a left speaker, and the other to a right speaker. This allows an immersive effect, most noticeable when listening with headphones. For example one instrument can be played in your left ear and another in your right ear. Or, a drum roll can sweep from left to right and back again. Audio CDs are stereo.

Audio with 3 or more channels is less common, but allows for an even more immersive experience. Surround sound, which generally uses 6 channels, allows for sound to come from in front of and behind you, in addition to just left and right. Many modern movie theaters are capable of playing surround sound audio.

Compact Discs, Wave Files, and MP3s

Throughout this article we’ve talked about different ways of storing digital audio, such as CDs, Wave files, and MP3s. Let’s look at these a bit more in depth.

First, audio CDs. The surface of a CD contains millions of tiny pits. A CD player uses a laser to read the pattern of pits, and converts it into a series of 1s and 0s. CDs use 16 bits per sample, so every 16 0s or 1s represents one sample. CDs store stereo data, so separate samples are stored for the left and right speakers. Since CDs use a sample rate of 44,100Hz, every second of sound requires 88,200 samples. (44,100 samples for the sound in the left speaker, and 44,100 samples for the right speaker). When you play a CD, a stream of samples is read from the CD surface and sent to a digital-to-analog converter (DAC) inside the CD player. The resulting analog signal is sent to your speakers or headphones, which converts the signal to a sound wave via movement of the speaker cone.

Wave files (*.wav) on your computer are conceptually similar. Like a CD, a Wave file mostly consists of a long stream of raw samples. (They can use compression, but I’m not sure how common this is). The beginning includes a small header that indicates the bits per sample, sample rate, etc., but the rest is mostly just samples. When you play a Wave file, your computer sends this list of samples to your sound card, which converts it to an analog signal and sends it to your speakers. For more on the Wave file format, click here.

In an MP3 file, the list of samples is compacted to take up less space. The compaction process takes into account the peculiarities of the way our brains hear sound. This allows it to throw away some sample data, most of which we wouldn’t actually be able to hear. The end result is a file which is much smaller than a Wave file, but with similar (though not quite as good) sound quality. Handy if you are downloading songs from the Internet.

Conclusion

Well, that should cover the basics. You should now have some basic background information to help you write programs that work with sound, as well as sound smart at parties.

Footnotes

  1. Actually, sound waves can travel through any physical medium, not just air. For example water, or solid objects like a desk or wall.
  2. You might already associate the terms "sample" and "sampling" with something like P. Diddy re-appropriating Police songs. "Sample" is an overloaded term. It can refer to either an instantaneous point in a wave, like we have used it in this article, or it can refer to a short snippet of sound, often taken from a pre-existing song.

Graphs created with Flot

Copyright © 2005—2017 Joel Strait. Also at GitHub, Instagram, Twitter.