Phase Vocoder

Example Code To Come...

Brief History and Overview

The phase vocoder is a tool that today finds it place among computer music applications, but was initially created to provide telephone companies with an economic transmission system [1].  A lowered bit-rate allows for more data to fit within the same bandwidth.  Telephone companies want to fit as many voices through a single telephone line as possible in order to maximize profits so they began to research technology that would lower the bandwidth for a single voice.  Analysis-synthesis schemes were created to more efficiently encode the voice.  The synthesis is based on recreating a signal through a number of parameters that are calculated in the analysis of the input signal according to a predetermined model.  Dudley created a channel vocoder at Bell Labs in 1939 and about thirty years later, Flanagan and Golden at Bell Labs developed the first phase vocoder in 1966 [2].

The word "vocoder" stems from the contraction of voice and coder; the name is meant differentiate it from the channel vocoder [3].   The channel vocoder needed pitch tracking and voiced-unvoiced switching, instead the phase vocoder uses phase-derivative signals to transmit excitation information and doesn’t rely on decision-making algorithms [2].

Speech is a very band-limited process and can exhibit efficient data-compression when this characteristic is exploited.  It is produced through the movement of air through the nose and mouth while being manipulated by the lips, tongue, and jaw.  Broadly speaking, the mouth determines the overall frequency content of the speech [1].  The shape of the mouth can’t change faster than approximately 1/50th of a second, so according to Nyquist, it should be accurately sampled at 100 samples/second.  This sampling frequency is much lower than the needed sampling frequency needed to capture the actual speech signal; this shape of the overall shape of the vocal tract is called the spectral envelope [1].  While the spectral envelope changes relatively slowly, so does the excitation, the signal produced by the larynx.  The excitation is broadband noise for consonants and a more pitched signal for vowels.  Speech can then be represented through the slowly changing spectral envelope filtering a slowly changing excitation signal [1].

An early device that separated and transmitted the spectral envelop and excitation data was the voice-excited vocoder (VEV).  It relied on the transmission of an unprocessed subband of the original speech to carry the excitation information and the spectral envelope information was then transmitted by a number of slowly-varying signals, similar to the channel vocoder

How It All Works

The phase vocoder is an analysis-synthesis scheme; an input signal is broken down according to frequency.  The frequency data is then used to manipulate a series of oscillators whose output, when summed together, recreates the original input signal.

The phase vocoder can be modeled in two different ways as described by Dolson in "The Phase Vocoder: A Tutorial"- the filterbank interpretation and the Fourier-transform interpretation.  Both interpretations accomplish the same goal, they are merely different implementations of the bank of filters.  The filterbank interpretation is based on the assumption that a signal passed through a parallel bank of contiguous band-pass filters can be recombined to a signal extremely close to the original signal [2].

In the filterbank interpretation, the individual band-pass filters need to have identical frequency responses with different center frequencies, the center frequencies need to be equally spaced over the spectrum, and the summation of the frequency responses of the filters need to result in a flat response across the spectrum.  These constraints result in the number of filters their frequency responses being the only variables to design.  For harmonic signals, there should be at least one filter for each harmonic in the frequency range from 0 Hz to the Nyquist frequency.  Since there aren’t regularly spaced partials in inharmonic signals, the phase vocoder can produce undesired results.  When there are multiple partials within one band-pass filter, they can constructively and destructively interfere with each other [3].

The individual filters are limited by the tradeoff between frequency and time response: the sharper the frequency cutoff, the poorer the time response.  The balance between time and frequency response is dependent on the input signal; a quickly changing signal benefits from a greater temporal resolution while a slower changing signal is filtered better with higher frequency resolution.  It is also important that the filters have linear-phase; without it "the synthesized speaker sounds drunk" [4].  This might occur because the faster changing components of the signal happen at different times that the slower components.

The input into the filter bank goes through parallel processing.  The signal is split and part of it is multiplied by a sine wave and the other part is multiplied by a cosine wave.  This is implementing a method called heterodyning.  Heterodyning is the property that when a signal is multiplied by a phasor of frequency θ, the signal is shifted by θ [1].  This practice is often used in radio and television signals, but in the phase vocoder it is used to shift the input signal down.  When multiplied by a sine wave, the original signal is split into two frequencies at plus and minus the input and sine wave.  The signal is then low-pass filtered and only the lower frequency continues on in the vocoder [3].  So the resulting output of the low-pass filter is two low frequency signals that are identical except for their phase; one is 90 degrees ahead of the other.

The resulting sinusoids are translated from rectangular coordinates into polar form.  Polar form expresses phasors as a magnitude and phase, but the phase is limited between 0 and 2Π radians or 0 and 360 degrees.  A process called phase unwrapping then occurs which translates the polar form of the phase into values that continue past 360 [3].  The unwrapped values are then subtracted from successive values and divided by the time interval giving a frequency signal.  This frequency signal is then multiplied by the filter’s center frequency back in to the frequency signal [3].

Another way to view the phase vocoder is through the Fourier transform.  Instead of dividing spectrum of the input signal through band-pass filters, the signal is passed through a series of overlapping Fourier-transforms.  When x(n) represents the samples of a waveform, the discrete short-time Fourier transform is defined by:

                                          (1)

where k=0, 1, 2..., N-1 and  [5].   is an appropriately chosen window, the input signal is "viewed" through this window in sliding chunks across the signal.  The window needs to be smooth and of finite-duration like the Hamming, Kaiser, or Dolph-Chebyshev [5].

The Fourier-transform analyzes the frequency content of a signal over a windowed amount of time by producing a series of coefficients that correspond with a series of frequency bins.  The discrete Fourier-transform (DFT) is an implementation of the Fourier-transform for signals that are not perfectly periodic [1].  The tradeoff between temporal and frequency resolution exists in the Fourier-transform interpretation also.  The greater the window size, the more content is analyzed and the greater the frequency resolution.  But, the frequencies in a window tend to smear over the time-span of the window resulting in poor time resolution.  If the window size is decreased, the time resolution improves, but fewer frequencies can be analyzed and the frequency resolution decreases.

Each bin in the Fourier-transform has a sine and cosine component (in-phase and quadrature).  When these components are represented in polar form, the time-varying frequency within a bin can be calculated.  This is done by comparing the vector in successive Fourier-transforms [3].  These phase values need to be unwrapped like the ones in the filterbank interpretation.

The synthesis portion of the phase vocoder takes the inverse FFT of the data and summing the different components over time to recreate the signal.  Equation (1) can be used to derive the synthesis as:

   for all n                                    (2)

This can be interpreted as modulating the N signals  to the center frequencies  and adding the signals together.

Applications

The phase vocoder attempts to separate the time and frequency attributes of a signal.  The better the separation, the better the resulting output signal in recreating the original input.  This separation then allows for manipulation of one attribute without affecting the other.  Without any separation, shortening the time of a sound would cause the pitch to go up, hence we have our beloved Alvin and the Chipmunks.  For example, speeding up a sound file to twice its original speed causes the pitch to increase an octave.  In the same respect, the only way to manipulate the pitch of a sound was to increase or decrease the playback time.  However, the phase vocoder allows the lengthening of the playback of a file without affecting to pitch and also raise or lower the pitch without changing the length of the file.

Time scaling allows for the manipulation of the time domain without influencing the frequency content.  Expanding a sound file is easily done with the phase vocoder by interpolation.  When the time domain information that controls the output oscillators is lengthened through interpolation, the frequency content still remains unaffected [3].  The way to interpret the time expansion with an FFT is to space the inverse FFTs farther apart which relays the same spectral information, just over a longer period of time.  However, this causes a problem when the phase isn’t considered.  The phase won’t line up between FFT bins after a time expansion without any scaling to the phase signal.  Spreading out the phase increase without rescaling causes the same phase increase to occur over a longer period of time.  This is solved by scaling the phase signal by the same factor as the time domain [3].

Transposing the pitch without affecting the time domain is also accomplished in a fairly simple manner and through the same means as time scaling.  The signal is time scaled and then played back at a different sampling rate producing the desired pitch transposition.  The most efficient method to do this is to perform sample rate conversion on the time scaled file instead of changing the clock rate on the digital-to-analog converter [3].

These applications are limited; a time scale factor must be a ratio of integers.  This is seen in the definition of the expansion factor which is ratio of the number of samples in the original analysis FFT to the number of samples in the synthesis FFT [3].  This often isn’t a problem with time scaling, but cause be bothersome in pitch transposition.  The closest feasible integer ratio may not be exactly the desired number, but the error can be negligible.  However, the ear is much more sensitive to pitch and a discrepancy may be obvious and possibly acceptable.  Much more attention must be paid to selecting appropriate scaling factors when the time scaling is being implemented for pitch transposition.

Flanagan and Golden suggest an implementation of the phase vocoder to assist people with hearing problems in the high frequencies by using frequency division to bring down upper frequencies that might not be heard.  They also suggest using time compression for an "auditory ‘speed-reading’" for the blind [2].

Implementation in MATLAB

A phase vocoder was implemented in MATLAB using a FFT with a Hanning window.  The window overlap, number of FFT points, and scale factor were all input parameters.  An audio file of a solo tenor saxophone was the test file.  Fig. 2 shows the spectrogram of the audio without any processing done.

Original saxophone audio file   Fig. 2

When the file was passed through the vocoder with a scale factor of 1, little difference was made between that and the original signal, as desired.  Some minor artifacts were introduced.  Fig. 3 shows the file through the vocoder with a scale factor of 1, a window overlap of 25%, and 2048 FFT points.

The file was then time-scaled to expand the file to twice as long while keeping the pitch in tact.  The result is Fig. 4.

Saxophone audio with scale factor of 1      Fig. 3

When the file was pitch-shifted up, relatively few artifacts were added, but when the file was pitch-shifted down two octaves, the file ceased to sound like a tenor saxophone and more like a string bass.  This was due to missing overtones and harmonics that were thrown out in the rescaling.  Reflections of the frequencies also began to appear as seen in Fig. 6.

Saxophone file twice as long   Fig. 4

Saxophone file pitch-shifted up one octave    Fig. 5

Saxophone file pitch-shifted down two octaves    Fig. 6

Past Suggestions for Improvement

The phase vocoder’s most common complaint is "phasiness".  Phasiness is a word describing a number of artifacts introduced into the signal by the vocoder which are all based in errors caused by phase.  It has been described as a "loss of presence" or an addition of reverberation [6].  There are two types of phase-coherence: horizontal phase-coherence from frame to frame and vertical phase-coherence between neighboring channels.  Phase-coherence is essential for a quality synthesis because the Fourier-transform windows overlap.  The phase needs to remain consistent.  A constant sinusoid should have high vertical phase-coherence and a slowly changing frequency sinusoid should have similar phase in the channels surrounding the frequency [6].

It might be assumed that manipulating the signal will cause further errors in the phase, but in fact it doesn’t as long as the scale-factor is constant.  Errors can be introduced in phase-unwrapping with multiples of 2αΠ being added to the synthesis phase, but if α is an integer, the error isn’t noticed [6].

Past strategies for improving the quality of the vocoder and reducing phasiness include a method called magnitude-only reduction.  In this method, either the phase or magnitude is completely discarded and completely synthesized according to iterative techniques [6].  This technique isn’t completely successful and requires a lot of computation without really improving the sound.

Puckette proposed using a phase-locked vocoder.  He recognized that a constant-frequency, constant-amplitude sinusoid should vary ±π during the synthesis phase around the maximum of the Fourier transform [6].  It was found to be successful only in relation to the input signal [6, 7].

Puckette’s phase-locked vocoder inspired Laroche and Dolson’s peak phase-locked algorithm.  The algorithm detects the channel whose amplitude is larger than its four neighbors, then the phase is synthesized for that peak.  The collection of peaks in the frequency domain divides the spectrum into "regions of influence" [6].  The peak in each region locks the phase for the rest of the region.  A second step to the proposal is to do peak-detection and shift the regions around each peak to a new location according to previous data [6, 8].

Laroche and Dolson later refine this idea by detecting peaks in the STFT then translating them to new arbitrary frequencies under the theory that if the relative magnitudes and phases around a peak are preserved, then the time-domain signal corresponding to the shifted peak is a sinusoid modulated by the analysis window. The peak phases need to be consistent from frame to frame.  The frequency is no longer ω but ω+Δω so the peak phase can be rotated by ΔωR where R is the phase vocoder hop size.  Since the calculation doesn’t require knowledge of the exact value of ω, but only the shift Δω, no phase-unwrapping is needed [8].  Using this technique of shifting frequencies, more applications can be drawn from the vocoder including a chorusing effect, harmonizing, and creating an inharmonic sound from a harmonic sound, producing a bell effect.  And these all can be done in real time since there is no preliminary analysis stage [8].

Our Suggestions for Improvement

The nature of the FFT always sacrifices either frequency or time resolution.  Faster changing sounds, transients, are analyzed better through a smaller window while slower changing sounds benefit more from a larger window size.  A proposal to optimize these traits is to use an adaptive windowing scheme.  The input signal could be analyzed using a pitch detection algorithm that then determines the size of the window in the FFT.  An algorithm would also be needed to detect transients which can trigger a reduced window size.      If the window of the FFT is equal to the period of the fundamental frequency within that window, then phase errors are virtually eliminated.  The phase "lines up" across multiple frames and results in a better synthesis with fewer errors due to out of phase signals.

Another way to address the compromise between frequency and time resolution is to introduce wavelets into the system.  The input spectrum can be subdivided by bandpass filters.  The output of each filter then has a unique window size associated with it that fits the frequency content; higher frequencies have smaller windows than lower frequencies.

The algorithm implemented is more successful at expanding audio files than speeding them up.  This is because points are completely dropped and discarded in order to increase the sample rate which leads to discontinuities in the phase between the synthesized samples.  Interpolation between these points by dropping twice as many and then interpolating between the remaining points would ease the issues with phase-coherence.

Conclusion

The phase vocoder is a useful tool that was first implemented to reduce bandwidth and increase efficiency in phone lines, but is now used as a musical tool for synthesized sound.  Its ability to separate time and frequency allows the manipulation of the length of audio files without affecting pitch and also can shift the overall pitch of a file without changing its duration.

There are some artifacts and issues that are introduced into the audio by the vocoder, but a number of solutions have been proposed.  These solutions do not completely rid the vocoder of errors, but does greatly improve its performance.

References

  1. K. Steiglitz, A Digital Signal Processing Primer, Menlo Park, CA: Addison-Wesley, 1996.
  2. J. L. Flanagan, and R.M. Golden. "Phase Vocoder." Bell System Technical Journal, 45 pp.1493-1509, 1966.
  3. M. Dolson, "The phase vocoder: A tutorial." Computer Music Journal., 10:4 pp. 14-27, 1986.
  4. J.A. Moorer, "The Use of the Phase Vocoder in Computer Music Applications," Journal of the Audio Engineering Society., 26:1/2 pp.42-45, 1978.
  5. M.R. Portnoff, "Implementation of the Digital Phase Vocoder Using the Fast Fourier Transform," IEEE Trans. Acous. Speech, and Signal Proc., 24 pp. 243-248, 1976.
  6. J. Laroche and M. Dolson, "Phase-vocoder: About this phasiness business," in Proc. IEEE ASSP Workshop on App. of Sig. Proc. to Audio and Acous., New Paltz, NY, 1997.
  7. M. Puckette, "Phase-locked vocoder," in Proc. IEEE ASSP Workshop on App. of Sig. Proc. to Audio and Acous., New Paltz, NY, 1995.
  8. J. Laroche and M. Dolson, "New Phase-Vocoder Techniques for Pitch-Shifting, Harmonizing and Other Exotic Effects," Proc. IEEE Workshop on App. of Signal Proc. to Signal Proc. to Audio and Acous., New Paltz, NY 1999.

LSB Audio, LLC © 2008-2010   Privacy policy