← Joel Strait

A Digital Audio Primer

So it’s 1995, and the Digital Audio Revolution™ is all anyone can talk about. CDs, wave files, MP3s, sound cards, sampling – what the hell does it all mean?! And now it is all back in the spotlight, thanks to the WaveFile Ruby gem. Fortunately, your hip, tech-savvy friend Joel is here to help.

Audio 101: Sounds Are Waves

Let’s start at the beginning. Sound occurs when something moves. This movement causes the air around the object to compress and decompress in a wave[1]. Your ears can detect these compression waves, and your brain interprets as sound. Examples of things that can move and cause sound: vocal cords, guitar strings, exploding objects, or the speakers on your Hi-Fi system.

You can plot sound waves on a graph. Here is what the opening shout of Help! by The Beatles looks like:

1.0 0.0 -1.0
Time →

The most basic sound is a sine wave:

1.0 0.0 -1.0
Time →

A sine wave sounds like a beep. You might recognize it if you ever took a hearing test as a kid, or if you have a filthy mouth.

Waves have several important properties. The frequency is how often a wave pattern such as repeats. The higher the frequency, the higher the pitch.

A repeating wave pattern is called an oscillation or a cycle. Frequency is expressed in hertz, or cycles per second. If a cycle such as repeats 50 times in one second, it has a frequency of 50Hz. People can generally hear frequencies between about 20Hz and 20,000Hz. (As you get older, you gradually lose the ability to hear higher frequencies).

The period of a wave is the amount of time it takes for one cycle to complete. If a wave has a frequency of 50Hz, then it has a period of 1/50th of a second.

Try moving the slider to change the frequency of a sine wave. Notice how as the frequency goes up, each cycle happens more quickly, and the period goes down.

1.0 0.0 -1.0
Time →

The amplitude of a wave is the distance on a graph between its maximum height and 0. Amplitude determines how loud a sound is perceived. The higher the amplitude, the louder the sound.

Try moving the slider to change the amplitude of a sine wave. Notice how as the amplitude goes up, the sound becomes louder, and the height of the wave increases.

1.0 0.0 -1.0
Time →

All of the graphs above are normalized, so that the maximum possible amplitudes of sound from a source (such as a pair of speakers) are labeled 1.0 and -1.0. This convention is used in the audio world to simplify things. Of course, in the real world the exact loudness represented by an amplitude of 1.0 or -1.0 is relative.

Analog Waves vs. Digital Waves

So our humble sine wave is wafting through the air, and we want our computer to capture it and store it for later. To do this, we need to convert the sound wave from an analog form to a digital form. This process is called sampling, and is necessary because computers can only store data digitally.

Analog waves are continuous. Notice how the sine waves in the previous section are smooth. There are no gaps anywhere – if you were to zoom in, and keep zooming in into infinity, the line would always be smooth. If you have two points in the wave, there are an infinite number of points between them.

In contrast, a wave in digital form consists of a series of instantaneous “snapshots” of the amplitude of the wave over time. Each snapshot is called a sample[2]. What’s cool is that a digital wave is just a list of numbers, so it can be easily stored and used on a computer, unlike an analog wave. What’s even cooler is that if you take the samples fast enough, the digital collection of samples is mathematically equivalent to the analog wave, and you can convert back and forth between the two[3]. This means you can store your samples digitally on a computer for later use, and then convert them into an analog form when you want to play them on your speakers.

A Sample of Sampling

To sample a wave (i.e., convert it from analog to digital), one collects instantaneous amplitudes (i.e. y-values) over a regular time interval (i.e. x-values), and puts them in a list.

This animation shows an example of sampling a sine wave. The amplitude of the wave is measured at a regular internal (called the sample rate), and placed in the list below.

1.0 0.0 -1.0
Time →
0.000 -0.959 -0.544 0.650 0.913
0.841 -0.279 -1.000 -0.288 0.837
0.909 0.657 -0.537 -0.961 -0.009
0.141 0.989 0.420 -0.751 -0.846
-0.757 0.412 0.991 0.150 -0.906

Notice how the samples are all the same distance apart on the x-axis, indicating the same amount of time is elapsing between each sample. The x-axis denotes some arbitrary amount of time.

We have now sampled the wave. This list of samples can be saved into a file format such as *.wav or *.aiff (glossing over some technical details). When you play it back, your computer or stereo can convert the list of digital samples back into an analog wave so that it can be played on your speakers. Note that in real life, you would need many more samples to create a sound that lasts long enough to be heard.

Let’s look at an end-to-end example of how this all can work in the real world. Suppose you want to record yourself playing guitar, and have a microphone plugged into your computer for this purpose. As you strum, the microphone will detect the analog sound wave and convert it into an equivalent analog electrical signal (also a wave). This signal will be sent to your computer’s sound card, which will sample it many times a second to convert it to digital form. You can then save this digital audio as a Wave file, MP3, etc. When you use a program to play it back, the samples will be sent back to the sound card. It will convert the digital samples back into an analog electric signal and send it to your speakers, causing the speaker cones to move. The speaker movement will cause a sound wave to move through the air, which your ears will detect. You will frown at realizing you flubbed that chord.

Sample Rate

An important property of digital audio is the sample rate. This represents how many times a wave is sampled per second, and therefore also the number of samples that make up one second of sound. It is expressed in Hertz. The standard sample rate for CDs is 44,100Hz, meaning that 44,100 samples are used to create 1 second of sound.

The sample rate determines how accurately the frequency of the underlaying signal being sampled is captured. The higher the sample rate, the more frequencies can be accurately represented. According to the Nyquist–Shannon sampling theorem, if you sample at a rate of x hertz, you can accurately recreate analog frequencies up to ½x hertz. Or put differently, to accurately sample a frequency of x, you need to sample at a rate of 2x or higher. So for example if you want to sample some sound where the highest frequency is 5,000Hz, sampling at a rate of 10,000Hz will allow you to capture it accurately.

Since CD audio is sampled at 44,100Hz, CDs can accurately reproduce frequencies up to 22,050Hz. Humans can hear frequencies up to around 20,000Hz, so this means CDs can essentially record the range of human hearing. Historically, telephone signals used an 8,000Hz sample rate, which captured most but not all of the frequencies of the human speech, which made it sound, well, telephone-y.

Bits Per Sample / Sample Depth

While the sample rate determines how accurately we can represent analog frequencies with digital audio, the sample depth determines how accurately we can capture amplitudes. Or put differently, where the sample rate determines resolution on the x-axis, the sample depth determines resolution on the y-axis. The sample depth determines the difference between the quietest sound we can capture, and the loudest sound (called the dynamic range).

Digital audio commonly represents each sample as a 8, 16, or 24-bit integer. Each sample taken has to be converted to one of these numbers. 8-bit numbers can encode 256 different amplitudes, 16-bit numbers can encode 65,536, and 24-bit numbers can encode 16,777,216. Since an analog signal is continuous and has an infinite number of possible amplitudes, but 8/16/24-bit integers only allow a certain number of possible values, this means that the sampled signal in real world digital audio can’t 100% accurately capture the amplitude of a signal. The process of converting a sampled analog amplitude to one of the possible integer values is called quantization. The number of bits used to encode each sample is called the bits per sample.

Try choosing different values for the bits per sample. Listen to what these different sample depths sounds like, and notice how closely (or not closely) the digital samples match the original analog wave.

           
1.0 0.0 -1.0
Time →

Notice how the original analog signal changes in amplitude over time. The higher the bits per sample, the more accurately the fade in/out is captured in the digital signal. At 16 bits per sample, you can hear the fade in/out clearly. At 8 bits per sample, if you listen closely with headphones the quieter part of the signal is a little distorted. At 4 bits per sample the sound is very distorted due to there not be enough “buckets” available to for the amplitude to be mapped to. At 1 bit per sample, the fade in/out disappears completely!

Also, note how the tone still sounds like the same pitch at each sample depth. This is because the sample depth doesn’t affect how accurately frequencies are captured (only amplitudes). If you record a piano playing a middle A note at 440Hz with a high sample rate but low bits per sample, it might sound noisy and distorted, but it will still sound like an A note.

Although digital audio commonly uses integer samples, it’s also possible to use floating point numbers (normally in the range of -1.0 to 1.0). Floating point numbers can provide even more amplitude values than 24-bit integers.

Number of Channels

When audio consists of a single sound wave, it is said to have 1 channel, or to be monophonic. Audio that has 2 channels is called stereo, and consists of two separate sound waves that are played at the same time. One sound wave is sent to a left speaker, and the other to a right speaker. This allows an immersive effect, most noticeable when listening with headphones. For example one instrument can be played in your left ear and another in your right ear. Or, a drum roll can sweep from left to right and back again. Audio CDs are stereo.

Audio with 3 or more channels is less common, but allows for an even more immersive experience. Surround sound, which generally uses 6 channels, allows for sound to come from in front of and behind you, in addition to just left and right.

Creating Your Own Digital Audio

You probably want to write programs to create your own sounds. (Why wouldn’t you?) To do so, your program needs to generate a list of samples, and then send it to your sound card for playback. Sounds simple, but the details can be more complex. First, you need to know how to generate the right samples so it actually sounds like something. Second, you need to know how to actually send the samples to your sound card.

For an example of how to generate the right samples, check out this post about NanoSynth which describes how to create basic waves such as sine waves, square waves, etc using Ruby. You can work around the second problem by saving your samples to a sound format such as MP3 or Wave, and letting iTunes/Winamp/whatever handle the playback for you. The Wavefile Ruby gem makes it easy to create Wave files using Ruby. The example below shows a simple program outline you can use. You would just need to implement the generate_sample_data() function.

require 'wavefile'
include WaveFile

# Should return an array of numbers between -1.0 and 1.0
def generate_sample_data
  # FILL ME IN
end

samples = generate_sample_data

Writer.new("mysound.wav", Format.new(:mono, :pcm_16, 44100)) do |writer|
  buffer = Buffer.new(samples, Format.new(:mono, :float, 44100))
  writer.write(buffer)
end

Compact Discs, Wave Files, and MP3s

Throughout this article we’ve talked about different ways of storing digital audio, such as CDs, Wave files, and MP3s. Let’s look at these a bit more in depth.

First, audio CDs. The surface of a CD contains millions of tiny pits. A CD player uses a laser to read the pattern of pits, and converts it into a series of 1s and 0s. CDs use 16 bits per sample, so every 16 0s or 1s represents one sample. CDs store stereo data, so separate samples are stored for the left and right speakers. Since CDs use a sample rate of 44,100Hz, every second of sound requires 88,200 samples. (44,100 samples for the sound in the left speaker, and 44,100 samples for the right speaker). When you play a CD, a stream of samples is read from the CD surface and sent to a digital-to-analog converter (DAC) inside the CD player. The resulting analog signal is sent to your speakers or headphones, which converts the signal to a sound wave via movement of the speaker cone.

Wave (.wav) and AIFF (.aiff or .aif) files on your computer are conceptually similar. Like a CD, a Wave/AIFF file mostly consists of a long stream of raw samples. (They can use compression, but I’m not sure how common this is). The beginning includes a small header that indicates the bits per sample, sample rate, etc., but the rest is mostly just samples. When you play a Wave/AIFF file, your computer sends this list of samples to your sound card, which converts it to an analog signal and sends it to your speakers. The documentation for the WaveFile gem has more info on the Wave file format.

In an MP3 file, the list of samples is compacted to take up less space. The compaction process takes into account the peculiarities of the way our brains hear sound. This allows it to throw away some sample data, most of which we wouldn’t actually be able to hear. The end result is a file which is much smaller than a Wave file, but with similar (though not quite as good) sound quality. Handy if you are downloading songs from the Internet.

Conclusion

Well, that should cover the basics. You should now have some basic background information to help you write programs that work with sound, as well as sound smart at parties.

Footnotes

  1. Sound waves can travel through any physical medium, not just air. For example water, or solid objects like a desk or wall.
  2. "Sample" is an overloaded term. It can refer to either an instantaneous point in a wave, like we have used it in this article, or in other contexts it can refer to a short snippet of sound, often taken from a pre-existing song.
  3. In the real world, sampling never results in a digital wave that is a perfect match with the original analog wave, for a variety of reasons. We'll cover a few later in this article.
Also at GitHub, Instagram, Twitter Copyright © 2005—2017 Joel Strait