The Audio Basics
Physically speaking, sound is air pressure changing continuously over time. Electronically speaking, analog audio signals are voltages changing continuously over time, as a representation of the changing air pressure. The key word here is “continuously.”
Digital information, on the other hand, is not continuous. It is discrete. It consists of specific, unambiguous, individual pieces of information that can be reduced to simple binary expressions. It requires that we convert the ongoing continuum that is analog audio into discrete chunks of data. To do this, we superimpose the analog signal on a 2-dimensional grid that represents time and voltage and assign a single voltage value to each time increment in the grid, thereby going from a continuous signal to a sequence of discrete ones. The result is that what was a continuously varying voltage becomes a stream of numbers. That number stream is expressed in binary (1s and 0s) and is recorded and stored as a magnetic pulse code, usually integrated into a digital language of some sort.
To reproduce the sound stored this way, first we replay the magnetic pulse code to recreate the digital data stream of numbers representing the sound. We then reconstruct the time/voltage grid and the map of values on it that is recovered from the digital data stream, and then we smooth off the rough corners of the grid by analog lowpass filtering to recover our analog audio, which then gets converted back into sound via loudspeakers like any other audio signal.
No big deal!
Here’s a pictorial version of the above process:
Above we have a representation of continuous voltage change over time. Zero volts is in the middle and voltage excursions are both positive and negative. Halfway through is a rapidly changing section of signal, preceded and followed by more slowly and smoothly changing voltages. This center portion has more high frequency content in it.
Now we have laid this signal trace onto a grid which divides both voltage and time into a discrete set of spaces. We will ultimately have one chunk of data for each chunk of time.
The bold line above represents the process of sampling the waveform, which is to say that at each sampled point in time we note the actual analog voltage. Note that the high-frequency swing in the middle is lost, because it cannot be detected at the given sampling rate.
The drawing above represents the all-important process of quantization. Here, the sampled analog voltage values are rounded to the nearest discrete grid value, and each discrete value is then expressed as a 4-bit binary number. If I’d drawn this a little more carefully, we could have fit the signal into 3-bit binary numbers. In any case, we now have a data stream of binary numbers that represent the audio signal.
The above picture is an example of the sort of pulse code that is actually recorded. Ones are represented by positive voltages and zeroes by negative voltages. So the actual recording is still of a changing waveform, but that waveform represents numbers,
not actual analog voltage values. The frequency of the recording is much higher than analog, but we have virtually no concern for distortion or dynamic range when making this recording.
This is the reconstructed digital data derived from the pulse code data stream. Note that it is a perfect replication of the quantized data that was used to create the data stream in the first place.
Here we have the reconstructed analog waveform (in bold) after filtering. To draw this I used the smoothing function in my graphics program (MacDraft). Note that the smoothing has left too much high-frequency material, which is the same as an audio filter with the cut frequency set too high. Also, I have included the original waveform to show how things have changed. To begin with, our sampling rate is way too slow too accurately capture the high-frequency elements of this signal. Also, our 4-bit data does not provide adequate amplitude resolution, so that there are really significant amplitude errors throughout. The dynamic range of this signal is only 24 dB.
The fundamentals to keep in mind are:
For digital audio to be sonically satisfactory, we have to make the digital data chunks small enough so that they appear to us to be indistinguishable from the analog continuum. The total number of data bits required are the product of the number of time samples multiplied by the number of bits used for each sample. Generally accepted values for these are 48,000 time samples per second, and sixteen bits of resolution per sample. Such values permit, in theory, an audio bandwidth up to 24 KHz. with a dynamic range of 96 dB from noise floor to distortion.
Dynamic range and signal-to-noise/headroom ratios are determined by the number of bits, which in turn determine the number of possible different voltages that can be selected for any given time increment. Each bit doubles the number of possible voltage increments, which is another way of saying that it increases the dynamic range by 6 dB. A one-bit piece of data has a dynamic range of 6 dB, a four-bit piece of data has a range of 24 dB, and a sixteen-bit piece of data has a range of 96 dB. You can also think of these in terms of powers of two: one bit (21) has only two possible values, two bits (22) has four possible values, four bits (24) has sixteen possible values, and sixteen bits (216) has 65,536 possible values.
48,000 samples times 16 bits is 768,000 bits of data per second of audio. A three-minute song, in stereo, requires 276,480,000 bits of data! That’s right, a three minute song takes slightly more than a quarter of a billion bits of data. This is, as they say in Craters of the Moon National Park, Idaho, “a heap o’ data!” This data bulk has presented some of the biggest engineering challenges in developing and implementing digital audio. At the same time, it has made available bandwidths and dynamic ranges beyond the physical constraints that analog tape recorders have generally placed on us.
Frequency response and bandwidth are determined by the
rate at which we sample time increments. The rule is that we must have
a minimum of two time samples to represent any frequency. This is the so-called Nyquist Theorem, and it is, if we stop and think about it, intuitively apparent: we need at least one positive and one negative value to represent the existence of an oscillating wave over time.
Most errors are introduced into the system during the conversion process (i.e. converting from analog to digital or from digital to analog), while comparatively few are introduced during recording/playback. One of the big attractions of digital storage is its unambiguous nature. A
one is always a
one, regardless of how it is scrawled or what font is used. So, in digital, if you store a 1 and get back a 1, you have perfect reproduction. It is not logically possible to get back a .99 or a 1.03 as a distortion of 1 in digital data, as it is in analog. At the same time, inaccurate analog-to-digital converters (ADC) and digital-to-analog (DAC) converters can result in audibly degraded sound, even if they have the requisite bandwidth and dynamic range.
Analog audio frequencies that are
above the bandwidth (which is to say there are less than two samples to represent them) are a major problem for digital systems, because they will be frequency-shifted down into the audio bandwidth as spurious artifacts (I am not going to explain this here; just trust me). This is why we need low-pass anti-aliasing filters
prior to analog-to-digital conversion and low-pass anti-imaging filters
after reconstruction back into analog.
comments: (0)