Beginning HTML5 Media. Make the most of the new video and audio standards for the Web (2015)

CHAPTER 6

image

Manipulating Audio Through the Web Audio API

When it comes to the Web, audio is...‘well’...it is just there. This is not to denigrate audio, but, in many respects, audio is treated as either an afterthought or an annoyance. Yet its importance can’t be understated. From the simple effects such as click sounds that act as user feedback to voice-over narrations describing products or events, audio is a major communication medium which, as one of the authors likes to say, “seals the deal.”

The key aspect of audio is that when it is digitized it can be manipulated. To do this one needs to stop regarding audio as sound and see it for what it really is: data that can be manipulated. And that brings us to the subject of this chapter: how to manipulate sound data in a web browser.

The Web Audio API (Application Programming Interface) complements the features we just learned about for manipulating video data. This enables the development of sophisticated web-based games or audio production applications where the audio can be dynamically created and modified in JavaScript. It also enables the visualization of audio data and the analysis of the data, for example, to determine a beat or identify which instruments are playing or whether a voice you are hearing is female or male.

The Web Audio API (http://webaudio.github.io/web-audio-api/) is a specification that is being developed by the W3C Audio Working Group. This specification has been implemented in all major desktop browsers except for Internet Explorer (IE). Microsoft has added it to its development roadmap, so we can assume universal support. Safari is using a webkit prefix in its implementation at this point in time. Mozilla used to implement a simpler audio processing specification called the “Audio Data API” but has since replaced this also with the Web Audio API specification.

Image Note  W3C Audio Working Group has also developed a Web MIDI API specification (www.w3.org/TR/webmidi/), but it is currently only available as a trial implementation by Google Chrome, so we won’t explain this API at this time.

Before we start it would be useful to review the basics of digital audio.

Bitdepth and Samplerates

We traditionally visualize sound as a sine wave—the closer together the waves, the higher the frequency and therefore the sound. As for the height of the waves, that’s called the amplitude of the signal, and the higher the wave, the louder the sound. These waves, an example is shown inFigure 6-1, are called the waveform. The horizontal line is time, and if the signal doesn’t leave the horizontal line, that’s silence.

9781484204610_Fig06-01

Figure 6-1. A typical waveform from Adobe Audition CC 2014

For any sound to be digitized, like a color image in Fireworks or Photoshop, the wave needs to be sampled. A sample is nothing more than a snapshot of a waveform sampled at fixed intervals. An audio CD, for example, is sampled at the frequency of 44,100 times per second, which is traditionally identified as 44.1kHz. The value sampled at the snapshot time is a digital number representing the volume at that time. How often the waveform is sampled each second is called the samplerate. The higher the sampling rate, the more accurately the originally analog sound is represented digitally. The downside to this, of course, is the higher the samplerate, the larger the file size.

Bitdepth is the resolution of the sample value. A bitdepth of 8 bits means that the snapshot is represented as a number ranging from –128 to 127 (i.e., the value fits in 8 bits). A bitdepth of 16 bits means that the number is between –32,768 to 32,767. If you do the math, you see that an 8-bit snapshot has 256 potential values per sample, whereas its 16-bit counterpart has just over 65,000 potential values per sample. The greater the number of potential values of a sample, the greater the dynamic range that the digital file can represent.

Stereo signals have a waveform for each ear. Each of these waveforms gets digitized, individually, into a sequence of samples. They are typically stored as a sequence of pairs, which get separated for playback into their individual channels.

When the numbers are played back in the order in which they were sampled and at the frequency they were sampled, they recreate the sound’s waveform. Obviously, a larger bitdepth and higher samplerate mean that the waveform is played back with greater accuracy—the more snapshots taken of the waveform result in a more accurate representation of the waveform. This explains why the songs from an album have such massive file sizes. They are sampled at the highest possible bitdepth.

The three most common samplerates used are 11.025kHz, 22.05kHz, and 44.1kHz. If you reduce the samplerate from 44.1kHz to 22.05kHz, you achieve a reduction of 50% in file size. You obtain an even more significant reduction, another 50%, if the rate is reduced to 11.025kHz. The problem is that reducing the samplerate reduces audio quality. Listening to Beethoven’s Ninth Symphony at 11.025kHz results in the music sounding as if it were playing from the inside of a tin can.

As a web designer or developer, your prime objective is to obtain the best quality sound at the smallest file size. Though many developers will tell you that 16-bit, 44.1kHz stereo is the way to go, you’ll quickly realize this is not necessarily true. For example, a 16-bit, 44.1kHz stereo sound of a mouse click or a sound lasting less than a couple of seconds—such as a whoosh as an object zips across the screen—is a waste of bandwidth. The duration is so short and the frequencies represented in the sound so limited that average users won’t realize it if you’ve made your click an 8-bit, 22.05kHz mono sound. They hear the click and move on. The same holds true for music files. The average user is most likely listening through the cheap speakers that were tossed in when he bought his PC. In this case, a 16-bit, 22.05kHz soundtrack will sound as good as its CD-quality rich cousin.

The HTML5 Audio Formats

In Chapter 1, we already discussed the three audio formats used for HTML 5: MP3, WAV, and OGG Vorbis. These are all encoded audio formats (i.e., the raw samples of an audio waveform are compressed to take up less space and be able to be transmitted faster over the Internet). All of these formats use perceptual encoding, which means they throw away from the audio stream all information that is not typically perceived by humans. When information gets tossed in this way, there is a corresponding decrease in file size. The information tossed for encoding includes sound frequencies your dog may be able to hear but you can’t. In short, you hear only the sound a human can perceive (and this sort of explains why animals aren’t huge fans of iPods).

All perceptual encoders allow you to choose how much audio is unimportant. Most encoders produce excellent quality files using no more than 16 Kbps to create voice recordings. When you create, for example, an MP3, you need to pay attention to the bandwidth. The format is fine, but if the bandwidth is not optimized for its intended use, your results will be unacceptable, which is why applications that create MP3 files ask you to set the bandwidth along with the samplerate.

In this chapter we deal with the raw audio samples and manipulate them to achieve professional audio effects. The browser takes care of decoding the compressed audio files for us.

So much for generalities; let’s get practical and manipulate the ones and zeros that are at the heart of audio data by using the Web Audio API. We start with filter graphs and the AudioContext.

Filter Graphs and the AudioContext

The Web Audio API specification is based on the idea of building a graph of connected AudioNode objects to define the overall audio rendering. This is very similar to the filter graph idea that is the basis of many media frameworks, including DirectShow, GStreamer, and also JACK the Audio Connection Kit.

The idea behind a filter graph is that one or more input signals (in our case: the sound signals) are connected to a destination renderer (the sound output) by sending the input signals through a sequence of filters (sound modifiers) that modify the input data in a specific way.

The term audio filter can mean anything that changes the timbre, harmonic content, pitch, or waveform of an audio signal. The specification includes filters for various audio uses including the following:

·     Spatialized audio to move sounds in the 3D space.

·     A convolution engine to simulate acoustic spaces.

·     Real-time frequency analysis to determine the composition of a sound.

·     Frequency filters to extract certain frequency regions.

·     Sample-accurate scheduled sound playback.

A filter graph in the Web Audio API is contained within an AudioContext and consists of connected AudioNode objects as demonstrated in Figure 6-2.

9781484204610_Fig06-02

Figure 6-2. Concept of a filter graph in the Web Audio API

As you can see, there are AudioNode objects without incoming connections—these are called source nodes. They can also only connect to one AudioNode. Examples are microphone input, media elements, remote microphone inputs (when connected via WebRTC), mere audio samples stored in a memory buffer, or artificial sounds sources like oscillators.

Image Note  WebRTC (Web Real-Time Communication) is an API definition drafted by the World Wide Web Consortium (W3C) that supports browser-to-browser applications for voice calling, video chat, and P2P file sharing without the need of either internal or external plug-ins. It is a huge topic and is beyond the scope of this book.

AudioNode objects without outgoing connections are called destination nodes, and they only have one incoming connection. Examples are audio output devices (speakers) and remote output devices (when connected via WebRTC).

Other AudioNode objects in the middle may have multiple input connections and/or multiple output connections and are intermediate processing nodes.

The developer doesn’t have to worry about low-level stream format details when two AudioNode objects are connected; the right thing just happens. For example, if a mono audio filter has a stereo input connected, it will just receive a mix of the left and right channels.

To get you started let’s create what could be considered the “Hello World” of web audio applications. To follow along with the examples, they are provided in full at http://html5videoguide.net/. Listing 6-1 shows a simple example, where an oscillator source is connected to the default speaker destination node. You’ll hear a sound wave at 1kHz frequency. A word of warning: We will be dealing with audio samples and files and you might want to make sure the volume of your computer is lowered.

Listing 6-1. A Simple Filter Graph of an Oscillator Source Node and a Sound Output Destination Node

// create web audio api context
var audioCtx = new (window.AudioContext || window.webkitAudioContext)();

// create Oscillator node
var oscillator = audioCtx.createOscillator();
oscillator.connect(audioCtx.destination);
oscillator.type = ’square’;
oscillator.frequency.value = 1000; // value in hertz

oscillator.start(0);

Image Note  If you want to hear different tones, simply change the number in oscillator.frequency.value. If you don’t add a frequency value, the default is 440Hz which, for the musically inclined, is an A above middle C. Why that value? If you have ever been to a symphony and heard the members of the orchestra tune their instruments, that tone has become the standard for concert pitch.

The oscillator.start() function has an optional argument to describe at what time in seconds the sound should start playing. Unfortunately, in Safari it is not optional. So make sure you add the 0.

Figure 6-3 shows the filter graph that we have created.

9781484204610_Fig06-03

Figure 6-3. Simple filtergraph examples in the Web Audio API

Let’s dig into the different constructs that the example has just introduced.

The AudioContext Interface

The AudioContext provides the environment in which AudioNode objects are created and connected to each other. All types of AudioNode objects are defined within the AudioContext shown in the following code. There are a lot of AudioNode objects and we’ll approach this object in steps and just explain bits of what we need in every step of this chapter.

[Constructor]
interface AudioContext : EventTarget {
     readonly    attribute AudioDestinationNode destination;
     readonly    attribute float                sampleRate;
     readonly    attribute double               currentTime;
     Promise                         suspend ();
     Promise                         resume ();
     Promise                         close ();
     readonly    attribute AudioContextState    state;
                 attribute EventHandler         onstatechange;
     OscillatorNode                  createOscillator ();
      ...
};
interface AudioDestinationNode : AudioNode {
    readonly    attribute unsigned long maxChannelCount;
};
enum AudioContextState {
    "suspended",
    "running",
    "closed"
};

Every AudioContext contains a single read-only AudioDestinationNode. The destination is typically the computer-connected audio output device such as speakers or headphones. Scripts will connect all audio that is meant to be heard by the user to this node as the “terminal” node in the AudioContext‘s filter graph. You can see that we’ve created the oscillator by using the AudioContext object and calling createOscillator() on it and set some of the parameters of the oscillator. Then we’ve connected it to the destination for speaker/headphone output by calling the connect() function on the oscillator object.

The sampleRate of the AudioContext is fixed for the lifetime of the AudioContext and sets the sampling rate for all AudioNodes in the context. Thus, no samplerate conversion is possible within an AudioContext. By default, the sampleRate is 44,100Hz.

The currentTime of the AudioContext is a time, in seconds, that represents the age of the AudioContext (i.e., it starts at zero when the context is created and increases in real time). All scheduled times are relative to it. It is important to keep in mind that all events in theAudioContext run against this clock and it progresses in fractions of seconds.

The suspend(), resume(), and close() calls will influence the currentTime and suspend, resume, or stop its increase. They also influence whether the AudioContext has control over the audio hardware. After a call to close(), the AudioContext becomes unusable for the creation of new nodes. The AudioContextState represents the state that the AudioContext is in: suspended, running, or closed.

Image Note  Browsers do not currently support the suspend(), resume(), and close(), functions and the state attribute.

Listing 6-2 shows a couple of the parameters of an AudioContext with the result displayed in Figure 6-4.

Listing 6-2. The Parameters of the AudioContext

<div id="display"></div>
    <script type="text/javascript">
      var display = document.getElementById("display");
      var context = new (window.AudioContext || window.webkitAudioContext)();
      display.innerHTML  = context.sampleRate + " sampling rate<br/>";
      display.innerHTML += context.destination.numberOfChannels
                                               + " output channels<br/>";
      display.innerHTML += context.currentTime + " currentTime<br/>";
    </script>

9781484204610_Fig06-04

Figure 6-4. The parameters of the AudioContext by default in Chrome

The number of output channels is unknown until a sound is played through the AudioDestinationNode.

Let’s take another look at that oscillator. It’s of the type OscillatorNode which contains a few attributes that we can manipulate.

interface OscillatorNode : AudioNode {
                attribute OscillatorType type;
    readonly    attribute AudioParam     frequency;
    readonly    attribute AudioParam     detune;
    void start (optional double when = 0);
    void stop (optional double when = 0);
    void setPeriodicWave (PeriodicWave periodicWave);
                attribute EventHandler   onended;
};
enum OscillatorType {
    "sine",
    "square",
    "sawtooth",
    "triangle",
    "custom"
};

The first attribute is the OscillatorType , which we set to “square” in our example. You can change it in the example and will notice how the timbre of the tone changes, while its frequency stays the same.

The frequency is an AudioParam object which we’ll look at in just a minute. It has a value that can be set—and we set it to 1,000Hz in our example.

The OscillatorNode further has a detune attribute which will offset the frequency by the given percentage amount. Its default is 0. Detuning can help make a note sound more natural.

The start() and stop() methods on the OscillatorNode determine when a oscillator starts and ends in reference to the currentTime of the AudioContext. Note that you can call start() and stop() only once because they define the extent of the sound’s existence. You can, however, connect and disconnect the oscillator from the AudioDestinationNode (or whichever is the next AudioNode in the filter graph) to pause/unpause the sound.

The setPeriodicWave() function allows setting a custom oscillator waveform. Use the createPeriodicWave() function of the AudioContext to create a custom waveform using arrays of Fourier coefficients, being the partials of the periodic waveform. Unless you’re writing a synthesizer, you probably don’t have to understand this.

The AudioParam Interface

The AudioParam object type that we just used for the frequency and detune attributes of the OscillatorNode is actually quite important, so let’s try to understand it better. It is core to any audio processing that an AudioNode undertakes in a filter graph, since it holds the parameters that control key aspects of AudioNodes. In our example, it’s the frequency at which the oscillator runs. We can change that frequency at any time through the value parameter. That means that an event is scheduled to change the oscillator’s frequency at the next possible instant.

Since browsers can be quite busy, it’s not foreseeable when this event will happen. That’s probably okay in our example, but if you are a musician, you will want to be very accurate with your timing. Therefore, every AudioParam maintains a list of time-ordered change events. The times at which the changes are scheduled are in the time coordinate system of the AudioContext’s currentTime attribute. The events either initiate changes immediately or start/end them.

Following are the components of the AudioParam interface:

interface AudioParam {
                attribute float value;
    readonly    attribute float defaultValue;
    void setValueAtTime (float value, double startTime);
    void linearRampToValueAtTime (float value, double endTime);
    void exponentialRampToValueAtTime (float value, double endTime);
    void setTargetAtTime (float target, double startTime, float timeConstant);
    void setValueCurveAtTime (Float32Array values, double startTime,
                              double duration);
    void cancelScheduledValues (double startTime);
};

The event list is maintained internally by the AudioContext. The following methods can change the event list by adding a new event into the list. The type of event is defined by the method. These methods are called automation methods.

·     setValueAtTime() tells the AudioNode to change its AudioParam to value at the given startTime.

·     linearRampToValueAtTime() tells the AudioNode to ramp up its AudioParam value to value by a given endTime. This means it either ramps up from “right now” or from the previous event in the event list of the AudioNode.

·     exponentialRampToValueAtTime() tells the AudioNode to ramp up its AudioParam value using an exponential continuous change from the previous scheduled parameter value to the given value by a given endTime. Parameters representing filter frequencies and playback rate are best changed exponentially because of the way humans perceive sound.

·     setTargetAtTime() tells the AudioNode to start exponentially approaching the target value at the given startTime with a rate having the given timeConstant. Among other uses, this is useful for implementing the “decay” and “release” portions of an ADSR (Attack-Decay-Sustain-Release) envelope. The parameter value does not immediately change to the target value at the given time, but instead gradually changes to the target value. The larger the timeConstant, the slower the transition.

·     setValueCurveAtTime() tells the AudioNode to adapt its value following an array of arbitrary parameter values starting at the given startTime for the given duration. The number of values will be scaled to fit into the desired duration.

·     cancelScheduledValues() tells the AudioNode to cancel all scheduled parameter changes with times greater than or equal to startTime.

Scheduled events are useful for such tasks as envelopes, volume fades, LFOs (low-frequency oscillations), filter sweeps, or grain windows. We’re not going to explain these, but professional musicians will know what they are. It is only important you understand that the automation methods provide a mechanism to change a parameter value from one value to another at given time instances and that there are different curves the change can follow. In this way, arbitrary timeline-based automation curves can be set on any AudioParam. We’ll see what that means in an example.

Using a timeline, Figure 6-5 shows an automation plan for changing the frequency of an oscillator employing all the preceding introduced methods. In black are the setValueAtTime calls, each with their value as a black cross and their startTime on the currentTime timeline. The exponentialRampToValueAtTime and linearRampToValueAtTime calls have a target value (gray cross) at an endTime (gray line on the timeline). The setTargetAtTime call has a startTime (gray line on the timeline) and a target value (gray cross). ThesetValueCurveAtTime call has a startTime, a duration, and a number of values it goes through during that time. All of these combine to create the beeping tone and the changes you will hear when you test the code.

9781484204610_Fig06-05

Figure 6-5. AudioParam automation for oscillator frequency

Listing 6-3 shows how we adapted Listing 6-1 with this automation.

Listing 6-3. A Frequency Automation for an Oscillator

var audioCtx = new (window.AudioContext || window.webkitAudioContext)();
  var oscillator = audioCtx.createOscillator();
  var freqArray = new Float32Array(5);
  freqArray[0] = 4000;
  freqArray[1] = 3000;
  freqArray[2] = 1500;
  freqArray[3] = 3000;
  freqArray[4] = 1500;

  oscillator.type = ’square’;
  oscillator.frequency.value = 100; // value in hertz

  oscillator.connect(audioCtx.destination);
  oscillator.start(0);
  oscillator.frequency.cancelScheduledValues(audioCtx.currentTime);
  oscillator.frequency.setValueAtTime(500, audioCtx.currentTime + 1);
  oscillator.frequency.exponentialRampToValueAtTime(4000,
                                                   audioCtx.currentTime + 4);
  oscillator.frequency.setValueAtTime(3000, audioCtx.currentTime + 5);
  oscillator.frequency.linearRampToValueAtTime(1000,
                                               audioCtx.currentTime + 8);
  oscillator.frequency.setTargetAtTime(4000, audioCtx.currentTime + 10, 1);
  oscillator.frequency.setValueAtTime(1000, audioCtx.currentTime + 12);
  oscillator.frequency.setValueCurveAtTime(freqArray,
                                           audioCtx.currentTime + 14, 4);

AudioNodes can actually process value calculations either on each individual audio sample or on blocks of 128 samples. An AudioParam will specify that it is an a-rate parameter, when it needs the value calculation to be done individually for each sample of the block. It will specify that it’s a k-rate parameter, when only the first sample of the block is calculated and the resulting value is used for the entire block.

The frequency and detune parameters of the oscillator are both a-rate parameters, since each individual audio sample potentially needs adjustment of frequency and detune.

An a-rate AudioParam takes the current audio parameter value for each sampleframe of the audio signal.

k-rate AudioParam uses the same initial audio parameter value for the whole block processed (i.e., 128 sampleframes).

The AudioNode Interface

Earlier we became acquainted with the OscillatorNode, which is a type of AudioNode—the building blocks of the filter graph. An OscillatorNode is a source node. We’ve also become acquainted with one type of destination node in the filter graph: theAudioDestinationNode. It’s time we take a deeper look into the AudioNode interface itself, since this is where the connect() function of the OscillatorNode originates from.

interface AudioNode : EventTarget {
    void connect (AudioNode destination, optional unsigned long output = 0,
                  optional unsigned long input = 0);
    void connect (AudioParam destination, optional unsigned long output = 0 );
    void disconnect (optional unsigned long output = 0);
    readonly    attribute AudioContext          context;
    readonly    attribute unsigned long         numberOfInputs;
    readonly    attribute unsigned long         numberOfOutputs;
                attribute unsigned long         channelCount;
                attribute ChannelCountMode      channelCountMode;
                attribute ChannelInterpretation channelInterpretation;
};

An AudioNode can only belong to a single AudioContext, stored in the context attribute.

The first connect() method in AudioNode connects it to another AudioNode. There can only be one connection between a given output of one specific node and a given input of another node. The output parameter specifies the output index from which to connect and similarly the input parameter specifies which input index of the destination AudioNode to connect to.

The numberOfInputs attribute of AudioNode provides the number of inputs feeding into the AudioNode and the numberOfOutputs provides the number coming out of the AudioNode. Source nodes have 0 inputs and destination nodes 0 outputs.

An AudioNode may have more outputs than inputs; thus fan-out is supported. It may also have more inputs than outputs, which supports fan-in. It is possible to connect an AudioNode to another AudioNode and back, creating a cycle. This is allowed only if there is at least oneDelayNode in the cycle or you’ll get a NotSupportedError exception.

Each input and output of a nonsource and nondestination AudioNode has one or more channels. The exact number of inputs, outputs, and their channels depends on the type of AudioNode.

The channelCount attribute contains the number of channels that the AudioNode inherently deals with. By default, it’s 2, but that may be overwritten by an explicit new value for this attribute, or through the channelCountMode attribute.

enum ChannelCountMode {     "max",     "clamped-max",     "explicit" };

The values have the following meaning:

·     When the channelCountMode is “max,” the number of channels that the AudioNode deals with is the maximum number of channels of all the input and output connections and the channelCount is ignored.

·     When the channelCountMode is “clamped-max,” the number of channels that the AudioNode deals with is the maximum number of channels of all the input and output connections, but a maximum of channelCount.

·     When the channelCountMode is “explicit,” the number of channels that the AudioNode deals with is determined by channelCount.

For each input, an AudioNode does mixing (usually an upmixing) of all connections to that node. When the channels of the inputs need to be down- or upmixed, the channelInterpretation attribute determines how this down- or upmixing should be treated.

enum ChannelInterpretation { "speakers", "discrete" };

When channelInterpretation is “discrete,” upmixing is done by filling channels until they run out and then zeroing out remaining channels and downmixing by filling as many channels as possible, then dropping remaining channels.

If channelInterpretation is set to “speaker,” then the upmixing and downmixing are defined for specific channel layouts:

·     1 channel: mono (channel 0)

·     2 channels: left (channel 0), right (channel 1)

·     4 channels: left (ch 0), right (ch 1), surround left (ch 2), surround right (ch 3)

·     5.1 channels: left (ch 0), right (ch 1), center (ch 2), subwoofer (ch 3), surround left (ch 4), surround right (ch 5)

Upmixing works as follows:

·     mono: copy to left & right channels (for 2 & 4), copy to center (for 5.1)

·     stereo: copy to left and right channels (for 4 and 5.1)

·     4 channels: copy to left and right and surround left and surround right (for 5.1)

Every other channel stays at 0.

Downmixing works as follows:

·     Mono downmix:

·        2 -> 1: 0.5 * (left + right)

·        4 -> 1: 0.25 * (left + right + surround left + surround right)

·        5.1 -> 1: 0.7071 * (left + right) + center

·        + 0.5 * (surround left + surround right)

·     Stereo downmix:

·        4 -> 2: left = 0.5 * (left + surround left)

             left = 0.5 * (right + surround right)

·        5.1 -> 2: left = L + 0.7071 * (center + surround left)

                 right = R + 0.7071 * (center + surround right)

·     Quad downmix:

·        5.1 -> 4: left = left + 0.7071 * center

                 right = right + 0.7071 * center

                surround left = surround left

                surround right = surround right

Figure 6-6 describes a hypothetical input and output scenario for an AudioNode with diverse sets of channels on each of these. If the number of channels is not in 1, 2, 4, and/or 6, then the “discrete” interpretation is used.

9781484204610_Fig06-06

Figure 6-6. Channels and inputs/outputs of an AudioNode

Finally, the second connect() method on the AudioNode object connects an AudioParam to an AudioNode. This means that the parameter value is controlled with an audio signal.

It is possible to connect an AudioNode output to more than one AudioParam with multiple calls to connect(), thus supporting fan-out and controlling multiple AudioParam settings with a single audio signal. It is also possible to connect more than one AudioNode output to a single AudioParam with multiple calls to connect(); thus fan-in is supported and controls a single AudioParam with multiple audio inputs.

There can only be one connection between a given output of one specific node and a specific AudioParam. Multiple connections between the same AudioNode and the same AudioParam are ignored.

An AudioParam will take the rendered audio data from any AudioNode output connected to it and convert it to mono by downmixing (if it is not already mono). Next, it will mix it together with any other such outputs, and the intrinsic parameter value (the value the AudioParamwould normally have without any audio connections), including any timeline changes scheduled for the parameter.

We’ll demonstrate this functionality via an example of an oscillator that is manipulating a GainNode’s gain setting—a so-called LFO. The gain merely means to increase the power of the signal, which results in an increase in its volume. The gain setting of the GainNode is put in between a frequency-fixed oscillator and the destination node and thus causes the fixed tone to be rendered at an oscillating gain (see Listing 6-4).

Listing 6-4. An Oscillator’s Gain Is Manipulated by Another Oscillator

var audioCtx = new (window.AudioContext || window.webkitAudioContext)();
var oscillator = audioCtx.createOscillator();

// second oscillator that will be used as an LFO
var lfo = audioCtx.createOscillator();
lfo.type = ’sine’;
lfo.frequency.value = 2.0; // 2Hz: low-frequency oscillation

// create a gain whose gain AudioParam will be controlled by the LFO
var gain = audioCtx.createGain();
lfo.connect(gain.gain);

// set up the filter graph and start the nodes
oscillator.connect(gain);
gain.connect(audioCtx.destination);
oscillator.start(0);
lfo.start(0);

When running Listing 6-3, you will hear a tone of frequency 440Hz (the default frequency of the oscillator) pulsating between a gain of 0 and 1 at a frequency of twice per second. Figure 6-7 explains the setup of the filter graph. Pay particular attention to the fact that the lfo OscillatorNode is connected to the gain parameter of the GainNode and not to the gain node itself.

9781484204610_Fig06-07

Figure 6-7. Channels and inputs/outputs of an AudioNode

We just used another function of the AudioContext to create a GainNode.

[Constructor] interface AudioContext : EventTarget {
               ...
               GainNode                        createGain ();
               ...
}

For completeness, following is the definition of a GainNode:

interface GainNode : AudioNode {
    readonly    attribute AudioParam gain;
};

The gain parameter represents the amount of gain to apply. Its default value is 1 (no gain change). The nominal minValue is 0, but may be negative for phase inversion. Phase inversion, in simple terms, is “flipping the signal”—think of the sine waves that we used to explain audio signals at the beginning —to invert their phase means to mirror their value on the time axis. The nominal maxValue is 1, but higher values are allowed. This parameter is a-rate.

A GainNode takes one input and creates one output, the ChannelCountMode is “max” (i.e., it deals with as many channels as it is given) and the ChannelInterpretation is “speakers” (i.e., up- or downmixing is performed for output).

Reading and Generating Audio Data

Thus far we have created audio data via an oscillator. In general, you will, however, want to read an audio file, then take the audio data and manipulate it.

The AudioContext provides functionality to do this.

[Constructor] interface AudioContext : EventTarget {
               ...
Promise<AudioBuffer>  decodeAudioData (ArrayBuffer audioData,
                      optional DecodeSuccessCallback successCallback,
                      optional DecodeErrorCallback errorCallback);
AudioBufferSourceNode createBufferSource();
               ...
}
callback DecodeErrorCallback   = void (DOMException error);
callback DecodeSuccessCallback = void (AudioBuffer decodedData);

The decodeAudioData() function asynchronously decodes the audio file data contained in the ArrayBuffer. To use it, we first have to fetch the audio file into an ArrayBuffer. Then we can decode it into an AudioBuffer and hand that AudioBuffer to anAudioBufferSourceNode. Now it is in an AudioNode and can be connected through the filter graph (e.g., played back via a destination node).

The XHR (HTMLHttpRequest) interface is made for fetching data from a server). We will use it to get the file data into an ArrayBuffer. We’ll assume you’re familiar with XHR, since it’s not a media-specific interface.

In Listing 6-5, we retrieve the file “transition.wav” using XHR, then decode the received data into an AudioBufferSourceNode by calling the AudioContext’s decodeAudioData() function.

Image Note  Thanks to CadereSounds for making the “transition.wav” sample available under a Creative Commons license on freesound (see www.freesound.org/people/CadereSounds/sounds/267125/).

Listing 6-5. Fetching a Media Resource Using XHR

var audioCtx = new (window.AudioContext || window.webkitAudioContext)();
  var source = audioCtx.createBufferSource();

  var request = new XMLHttpRequest();
  var url = ’audio/transition.wav’;

  function requestData(url) {
    request.open(’GET’, url, true);
    request.responseType = ’arraybuffer’;
    request.send();
  }

  function receivedData() {
    if ((request.status === 200 || request.status === 206)
        && request.readyState === 4) {
      var audioData = request.response;
      audioCtx.decodeAudioData(audioData,
        function(buffer) {
          source.buffer = buffer;
          source.connect(audioCtx.destination);
          source.loop = true;
          source.start(0);
        },
        function(error) {
          "Error with decoding audio data" + error.err
        }
      );
    }
  }

  request.addEventListener(’load’, receivedData, false);
  requestData(url);

First we define a function that does the XHR request for the file, then the receivedData() function is called after network retrieval. If the retrieval was successful, we hand the resulting ArrayBuffer to decodeAudioData().

Image Note  You have to upload this to a web server, because XHR regards file: URLs as not trustable. You could also use the more modern FileReader.readAsArrayBuffer(File) API instead of XHR, which doesn’t have the same problems.

Let’s look at the involved objects in order.

First, XHR gets the bytes of the audio file from the server and puts them into an ArrayBuffer. The browser can decode data in any format that an <audio> element can decode, too. The decodeAudioData() function will decode the audio data into linear PCM. If that’s successful, it will get resampled to the samplerate of the AudioContext and stored in an AudioBuffer object.

The AudioBuffer Interface

interface AudioBuffer {
    readonly    attribute float  sampleRate;
    readonly    attribute long   length;
    readonly    attribute double duration;
    readonly    attribute long   numberOfChannels;
    Float32Array getChannelData (unsigned long channel);
    void         copyFromChannel (Float32Array destination, long channelNumber,
                                  optional unsigned long startInChannel = 0);
    void         copyToChannel (Float32Array source, long channelNumber,
                                optional unsigned long startInChannel = 0);
};

This interface represents a memory-resident audio asset (e.g., for one-shot sounds and other short audio clips). Its format is non-interleaved IEEE 32-bit linear PCM with a nominal range of -1 to +1. It can contain one or more channels and may be used by one or more AudioContextobjects.

You typically use an AudioBuffer for short sounds—for longer sounds, such as a music soundtrack, you should use streaming with the audio element and MediaElementAudioSourceNode.

The sampleRate attribute contains the samplerate of the audio asset.

The length attribute contains the length of the audio asset in sampleframes.

The duration attribute contains the duration of the audio asset in seconds.

The numberOfChannels attribute contains the number of discrete channels of the audio asset.

The getChannelData() method returns a Float32Array of PCM audio data for the specific channel.

The copyFromChannel() method copies the samples from the specified channel of the AudioBuffer to the destination Float32Array. An optional offset to copy the data from the channel can be provided in the startInChannel parameter.

The copyToChannel() method copies the samples to the specified channel of the AudioBuffer, from the source Float32Array. An optional offset to copy the data from the channel can be provided in the startInChannel parameter.

An AudioBuffer can be added to a AudioBufferSourceNode for the audio asset to enter into a filter network.

You can create an AudioBuffer directly using the AudioContext’s createBuffer() method.

[Constructor]
interface AudioContext : EventTarget {
    ...
AudioBuffer  createBuffer (unsigned long numberOfChannels,
                           unsigned long length,
                           float sampleRate);
    ...
};

It will be filled with samples of the given length (in sample-frames), sampling rate, and number of channels and will contain only silence. However, most commonly, an AudioBuffer is being used for storage of decoded samples.

The AudioBufferSourceNode Interface

interface AudioBufferSourceNode : AudioNode {
                attribute AudioBuffer? buffer;
    readonly    attribute AudioParam   playbackRate;
    readonly    attribute AudioParam   detune;
                attribute boolean      loop;
                attribute double       loopStart;
                attribute double       loopEnd;
    void start (optional  double when = 0, optional double offset = 0,
                optional  double duration);
    void stop (optional   double when = 0);
                attribute EventHandler onended;
};

An AudioBufferSourceNode represents an audio source node with an in-memory audio asset in an AudioBuffer. As such, it has 0 inputs and 1 output. It is useful for playing short audio assets. The number of channels of the output always equals the number of channels of theAudioBuffer assigned to the buffer attribute, or is one channel of silence if buffer is NULL.

The buffer attribute contains the audio asset.

The playbackRate attribute contains the speed at which to render the audio asset. Its default value is 1. This parameter is k-rate.

The detune attribute modulates the speed at which the audio asset is rendered. Its default value is 0. Its nominal range is [-1,200; 1,200]. This parameter is k-rate.

Both playbackRate and detune are used together to determine a computedPlaybackRate value over time t:

computedPlaybackRate(t) = playbackRate(t) * pow(2, detune(t) / 1200)

The computedPlaybackRate is the effective speed at which the AudioBuffer of this AudioBufferSourceNode must be played. By default it’s 1.

The loop attribute indicates if the audio data should play in a loop. The default value is false.

The loopStart and loopEnd attributes provide an interval in seconds over which the loop should be run. By default, they go from 0 to duration of the buffer.

The start() method is used to schedule when sound playback will happen. The playback will stop automatically when the buffer’s audio data has been completely played (if the loop attribute is false), or when the stop() method has been called and the specified time has been reached. start() and stop() may not be issued multiple times for a given AudioBufferSourceNode.

Since the AudioBufferSourceNode is a AudioNode, it has a connect() method to participate in the filter network (e.g., to connect to the audio destination for playback).

The MediaElementAudioSourceNode Interface

Another type of source node that can be used to get audio data into the filter graph is the MediaElementAudioSourceNode.

interface MediaElementAudioSourceNode : AudioNode {
};

The AudioContext provides functionality to create such a node.

[Constructor] interface AudioContext : EventTarget {
               ...
MediaElementAudioSourceNode     createMediaElementSource(HTMLMediaElement
                                                         mediaElement);
               ...
}

Together, they allow introducing the audio from an <audio> or a <video> element as a source node. As such, the MediaElementAudioSourceNode has 0 inputs and 1 output. The number of channels of the output corresponds to the number of channels of the media referenced by theHTMLMediaElement, passed in as the argument to createMediaElementSource(), or is one silent channel if the HTMLMediaElement has no audio. Once connected, the HTMLMediaElement’s audio doesn’t play directly any more but through the filter graph.

The MediaElementAudioSourceNode should be used over an AudioBufferSourceNode for longer media files because the MediaElementSourceNode streams the resource. Listing 6-6 shows an example.

Listing 6-6. Streaming an Audio Element into an AudioContext

var audioCtx = new (window.AudioContext || window.webkitAudioContext)();
var mediaElement = document.getElementByTagName(’audio’)[0];
mediaElement.addEventListener(’play’, function() {
  var source = audioCtx.createMediaElementSource(mediaElement);
  source.connect(audioCtx.destination);
  source.start(0);
});

We have to wait for the play event to fire to be sure that the audio has loaded and has been decoded so the AudioContext can get the data. The audio of the audio element in Listing 6-6 is played back exactly once.

The MediaStreamAudioSourceNode Interface

A final type of source node that can be used to get audio data into the filter graph is the MediaStreamAudioSourceNode.

interface MediaStreamAudioSourceNode : AudioNode {
};

This interface represents an audio source from a MediaStream, which is basically a live audio input source—a microphone. We will not describe the MediaStream API in this book—it’s outside the scope of this book. However, once you have such a MediaStream object, theAudioContext provides functionality to turn the first AudioMediaStreamTrack (audio track) of the MediaStream into an audio source node in a filter graph.

[Constructor] interface AudioContext : EventTarget {
               ...
MediaStreamAudioSourceNode      createMediaStreamSource(MediaStream
                                                        mediaStream);
               ...
}

As such, the MediaStreamAudioSourceNode has 0 inputs and 1 output. The number of channels of the output corresponds to the number of channels of the AudioMediaStreamTrack, passed in via the argument to createMediaStreamSource(), or is one silent channel if the MediaStream has no audio.

Listing 6-7 shows an example.

Listing 6-7. Fetching an Audio Stream’s Audio into an AudioContext

navigator.getUserMedia = (navigator.getUserMedia ||
                          navigator.webkitGetUserMedia ||
                          navigator.mozGetUserMedia ||
                          navigator.msGetUserMedia);
var audioCtx = new (window.AudioContext || window.webkitAudioContext)();
var mediaElement = document.getElementsByTagName(’audio’)[0];
var source;

onSuccess = function(stream) {
   mediaElement.src = window.URL.createObjectURL(stream) || stream;
   mediaElement.onloadedmetadata = function(e) {
      mediaElement.play();
      mediaElement.muted = ’true’;
   };

   source = audioCtx.createMediaStreamSource(stream);
   source.connect(audioCtx.destination);
};

onError = function(err) {
   console.log(’The following getUserMedia error occured: ’ + err);
};

navigator.getUserMedia ({ audio: true }, onSuccess, onError);

The audio of the audio element in Listing 6-6 is played back through the filter network, despite being muted on the audio element.

Image Note  There is also an analogous MediaStreamAudioDestinationNode for rendering the output of a filter graph to a MediaStream object in preparation for streaming audio via a peer connection to another browser. The AudioContext’screateMediaStreamDestination() function creates such a destination node. This is, however, currently only implemented in Firefox.

Manipulating Audio Data

By now we have learned how to create audio data for our audio filter graph via four different mechanisms: an oscillator, an audio buffer, an audio file, and a microphone source. Next let’s look at the set of audio manipulation functions AudioContext provides to the web developer. These are standard audio manipulation functions that will be well understood by an audio professional.

Each one of these manipulation functions is represented in the filter graph via a processing node and is created via a create-function in the AudioContext:

[Constructor] interface AudioContext : EventTarget {
...
    GainNode                  createGain ();
    DelayNode                 createDelay(optional double maxDelayTime = 1.0);
    BiquadFilterNode          createBiquadFilter ();
    WaveShaperNode            createWaveShaper ();
    StereoPannerNode          createStereoPanner ();
    ConvolverNode             createConvolver ();
    ChannelSplitterNode       createChannelSplitter(optional unsigned long
                                                    numberOfOutputs = 6 );
    ChannelMergerNode         createChannelMerger(optional unsigned long
                                                  numberOfInputs = 6 );
    DynamicsCompressorNode    createDynamicsCompressor ();
               ...
}

The GainNode Interface

The GainNode represents a change in volume and is created with the createGain() method of the AudioContext.

interface GainNode : AudioNode {
    readonly    attribute AudioParam gain;
};

It causes a given gain to be applied to the input data before its propagation to the output. A GainNode always has exactly one input and one output, both with the same number of channels:

Number of inputs

1

Number of outputs

1

Channel count mode

“max”

Channel count

2

Channel interpretation

“speakers”

The gain parameter is a unitless value nominally between 0 and 1, where 1 implies no gain change. The parameter is a-rate so the gain is applied to each sampleframe and is multiplied to each corresponding sample of all input channels.

Gain can be changed over time and the new gain is applied using a de-zippering algorithm in order to prevent unaesthetic “clicks” from appearing in the resulting audio.

Listing 6-8 shows an example of manipulating the gain of an audio signal via a slider. Make sure to release the slider so the slider’s value actually changes when you try this for yourself. The filter graph consists of a MediaElementSourceNode, a GainNode, and anAudioDestinationNode.

Listing 6-8. Manipulating the Gain of an Audio Signal

<audio autoplay controls src="audio/Shivervein_Razorpalm.wav"></audio>
<input type="range"  min="0" max="1" step="0.05" value="1"/>

<script>
var audioCtx = new (window.AudioContext || window.webkitAudioContext)();
var mediaElement = document.getElementsByTagName(’audio’)[0];

var source = audioCtx.createMediaElementSource(mediaElement);
var gainNode = audioCtx.createGain();
source.connect(gainNode);
gainNode.connect(audioCtx.destination);

var slider = document.getElementsByTagName(’input’)[0];
slider.addEventListener(’change’, function() {
  gainNode.gain.value = slider.value;
});
</script>

Image Note  Remember to upload this example to a web server, because XHR regards file URLs as not trustable.

The DelayNode Interface

The DelayNode delays the incoming audio signal by a certain number of seconds and is created with the createDelay() method of the AudioContext.

interface DelayNode : AudioNode {
    readonly    attribute AudioParam dealyTime;
};

The default delayTime is 0 seconds (no delay). When the delay time is changed, the transition is smooth without noticeable clicks or glitches.

Number of inputs

1

Number of outputs

1

Channel count mode

“max”

Channel count

2

Channel interpretation

“speakers”

The minimum value is 0 and the maximum value is determined by the maxDelayTime argument to the AudioContext method createDelay.

The parameter is a-rate so the delay is applied to each sampleframe and is multiplied to each corresponding sample of all input channels.

A DelayNode is often used to create a cycle of filter nodes (e.g., in conjunction with a GainNode to create a repeating, decaying echo). When a DelayNode is used in a cycle, the value of the delayTime attribute is clamped to a minimum of 128 frames (one block).

Listing 6-9 shows an example of a decaying echo.

Listing 6-9. Decaying Echo via Gain and Delay Filters

<audio autoplay controls src="audio/Big%20Hit%201.wav"></audio>
<script>
var audioCtx = new (window.AudioContext || window.webkitAudioContext)();
var mediaElement = document.getElementsByTagName(’audio’)[0];

mediaElement.addEventListener(’play’, function() {
  var source = audioCtx.createMediaElementSource(mediaElement);
  var delay = audioCtx.createDelay();
  delay.delayTime.value = 0.5;

  var gain = audioCtx.createGain();
  gain.gain.value = 0.8;

  // play once
  source.connect(audioCtx.destination);

  // create decaying echo filter graph
  source.connect(delay);
  delay.connect(gain);
  gain.connect(delay);
  delay.connect(audioCtx.destination);
});
</script>

The audio is redirected into the filter graph using createMediaElementSource(). The source sound is directly connected to the destination for normal playback and then also fed into a delay and gain filter cycle with the decaying echo being also connected to the destination. Figure 6-8 shows the created filter graph.

9781484204610_Fig06-08

Figure 6-8. Filter graph of a decaying echo

Image Note  Thanks to robertmcdonald for making the “Big Hit 1.wav” sample available under a Creative Commons license on freesound (see www.freesound.org/people/robertmcdonald/sounds/139501/).

The BiquadFilterNode Interface

The BiquadFilterNode represents a low-order filter (see more at http://en.wikipedia.org/wiki/Digital_biquad_filter) and is created with the createBiquadFilter() method of the AudioContext. Low-order filters are the building blocks of basic tone controls (bass, mid, treble), graphic equalizers, and more advanced filters.

interface BiquadFilterNode : AudioNode {
                attribute BiquadFilterType type;
    readonly    attribute AudioParam       frequency;
    readonly    attribute AudioParam       detune;
    readonly    attribute AudioParam       Q;
    readonly    attribute AudioParam       gain;
    void getFrequencyResponse (Float32Array frequencyHz,
                               Float32Array magResponse,
                               Float32Array phaseResponse);
};

The filter parameters can be changed over time (e.g., a frequency change creates a filter sweep).

Number of inputs

1

Number of outputs

1

Channel count mode

“max”

Channel count

as many in the output as are in the input

Channel interpretation

“speakers”

Each BiquadFilterNode can be configured as one of a number of common filter types.

enum BiquadFilterType {
    "lowpass",
    "highpass",
    "bandpass",
    "lowshelf",
    "highshelf",
    "peaking",
    "notch",
    "allpass"
};

The default filter type is lowpass (http://webaudio.github.io/web-audio-api/#the-biquadfilternode-interface).

The frequency parameter’s default value is 350Hz and starts at 10Hz going up to half the Nyquist frequency (which is 22,050Hz for the default 44.1kHz sampling rate of the AudioContext). It provides frequency characteristics depending on the filter type—for example, the cut-off frequency for the low-pass and high-pass filters, or the center of the frequency band of the bandpass filter.

The detune parameter provides a percentage value for detuning the frequency to make it more natural. It defaults to 0.

Frequency and detune are a-rate parameters and together determine the computed frequency of the filter:

computedFrequency(t) = frequency(t) * pow(2, detune(t) / 1200)

The Q parameter is the quality factor of the biquad filter with a default value of 1 and a nominal range of 0.0001 to 1,000 (though 1 to 100 is most common).

The gain parameter provides the boost (in dB) to be applied to the biquad filter and has a default value of 0, with a nominal range of -40 to 40 (though 0 to 10 is most common).

The getFrequencyResponse() method calculates the frequency response for the frequencies specified in the frequencyHz frequency array and returns the linear magnitude response values in the magResponse output array and the phase response values in radians in thephaseResponse output array. This is particularly useful to visualize the filter shape.

Listing 6-10 shows an example of the different filter types applied to an audio source.

Listing 6-10. Different Biquad Filter Types Applied to an Audio Source

<audio autoplay controls src="audio/Shivervein_Razorpalm.wav"></audio>
<select class="type">
  <option>lowpass</option>
  <option>highpass</option>
  <option>bandpass</option>
  <option>lowshelf</option>
  <option>highshelf</option>
  <option>peaking</option>
  <option>notch</option>
  <option>allpass</option>
</select>

<script>
var audioCtx = new (window.AudioContext || window.webkitAudioContext)();
var mediaElement = document.getElementsByTagName(’audio’)[0];
var source = audioCtx.createMediaElementSource(mediaElement);
var bass = audioCtx.createBiquadFilter();

// Set up the biquad filter node with a low-pass filter type
bass.type = "lowpass";
bass.frequency.value = 6000;
bass.Q.value = 1;
bass.gain.value = 10;

mediaElement.addEventListener(’play’, function() {
  // create filter graph
  source.connect(bass);
  bass.connect(audioCtx.destination);
});

// Update the biquad filter type
var type = document.getElementsByClassName(’type’)[0];
type.addEventListener(’change’, function() {
  bass.type = type.value;
});
</script>

The input audio file is connected to a biquad filter with a frequency value of 6,000Hz, a quality factor of 1, and a 10 dB boost. The type of filter can be changed with a drop-down between all eight different filters. This way you get a good idea of the effect of these filters on an audio signal.

Using the getFrequencyResponse() method on this example, we can visualize the filter (see also http://webaudio-io2012.appspot.com/#34). Listing 6-11 shows how to draw a frequency–gain graph.

Listing 6-11. Drawing a Frequency-Gain Graph

<canvas width="600" height="200"></canvas>
<canvas width="600" height="200" style="display: none;"></canvas>

<script>
var audioCtx = new (window.AudioContext || window.webkitAudioContext)();
var canvas = document.getElementsByTagName(’canvas’)[0];
var ctxt = canvas.getContext(’2d’);
var scratch = document.getElementsByTagName(’canvas’)[1];
var sctxt = scratch.getContext(’2d’);

var dbScale = 60;
var width = 512;
var height = 200;
var pixelsPerDb = (0.5 * height) / dbScale;
var nrOctaves = 10;
var nyquist = 0.5 * audioCtx.sampleRate;

function dbToY(db) {
  var y = (0.5 * height) - pixelsPerDb * db;
  return y;
}

function drawAxes() {
  ctxt.textAlign = "center";
  // Draw frequency scale (x-axis).
  for (var octave = 0; octave <= nrOctaves; octave++) {
    var x = octave * width / nrOctaves;
    var f = nyquist * Math.pow(2.0, octave - nrOctaves);
    var value = f.toFixed(0);
    var unit = ’Hz’;
    if (f > 1000) {
      unit = ’KHz’;
      value = (f/1000).toFixed(1);
    }
    ctxt.strokeStyle = "black";
    ctxt.strokeText(value + unit, x, 20);

    ctxt.beginPath();
    ctxt.strokeStyle = "gray";
    ctxt.lineWidth = 1;
    ctxt.moveTo(x, 30);
    ctxt.lineTo(x, height);
    ctxt.stroke();
  }
  // Draw decibel scale (y-axis).
  for (var db = -dbScale; db < dbScale - 10; db += 10) {
      var y = dbToY(db);
      ctxt.strokeStyle = "black";
      ctxt.strokeText(db.toFixed(0) + "dB", width + 40, y);

      ctxt.beginPath();
      ctxt.strokeStyle = "gray";
      ctxt.moveTo(0, y);
      ctxt.lineTo(width, y);
      ctxt.stroke();
  }
  // save this drawing to the scratch canvas.
  sctxt.drawImage(canvas, 0, 0);
}
</script>

We use two canvases for this so we have a canvas to store the prepared grid and axes. We draw the frequency axis (x-axis) as 10 octaves down from the Nyquist frequency of the audio context. We draw the gain axis (y-axis) from -60 dB to 40 dB. Figure 6-9 shows the result.

9781484204610_Fig06-09

Figure 6-9. A frequency-gain graph

Now, all we need to do is draw the frequency response of our filters into this graph. Listing 6-12 shows the function to use for that.

Listing 6-12. Different Biquad Filter Types Applied to an Audio Source

function drawGraph() {
  // grab the axis and grid from scratch canvas.
  ctxt.clearRect(0, 0, 600, height);
  ctxt.drawImage(scratch, 0, 0);

  // grab the frequency response data.
  var frequencyHz = new Float32Array(width);
  var magResponse = new Float32Array(width);
  var phaseResponse = new Float32Array(width);
  for (var i = 0; i < width; ++i) {
    var f = i / width;
    // Convert to log frequency scale (octaves).
    f = nyquist * Math.pow(2.0, nrOctaves * (f - 1.0));
    frequencyHz[i] = f;
  }
  bass.getFrequencyResponse(frequencyHz, magResponse, phaseResponse);

  // draw the frequency response.
  ctxt.beginPath();
  ctxt.strokeStyle = "red";
  ctxt.lineWidth = 3;
  for (var i = 0; i < width; ++i) {
    var response = magResponse[i];
    var dbResponse = 20.0 * Math.log(response) / Math.LN10;
    var x = i;
    var y = dbToY(dbResponse);
    if ( i == 0 ) {
        ctxt.moveTo(x, y);
    } else {
        ctxt.lineTo(x, y);
    }
  }
  ctxt.stroke();
}

First we grab the earlier graph from the scratch canvas and add it to an emptied canvas. Then we prepare the frequency array for which we want to retrieve the response and call the getFrequencyResponse() method on it. Finally, we draw the curve of the frequency response by drawing lines from value to value. For the full example, combine Listings 6-10, 6-11, and 6-12 and call the drawGraph() function in the play event handler (see http://html5videoguide.net).

Figure 6-10 shows the results for the low-pass filter of Listing 6-10.

9781484204610_Fig06-10

Figure 6-10. Low-pass filter frequency response of Listing 6-10

The WaveShaperNode Interface

The WaveShaperNode represents nonlinear distortion effects and is created with the createWaveShaper() method of the AudioContext. Distortion effects create “warm” and “dirty” sounds by compressing or clipping the peaks of a sound wave, which results in a large number of added overtones. This AudioNode uses a curve to apply a waveshaping distortion to the signal.

interface WaveShaperNode : AudioNode {
                attribute Float32Array?  curve;
                attribute OverSampleType oversample;
};

The curve array contains sampled values of the shaping curve.

The oversample parameter specifies what type of oversampling should be applied to the input signal when applying the shaping curve.

enum OverSampleType {
    "none",
    "2x",
    "4x"
};

The default value is “none,” meaning the curve is applied directly to the input samples. A value of “2x” or “4x” can improve the quality of the processing by avoiding some aliasing, with “4x” yielding the highest quality.

Number of inputs

1

Number of outputs

1

Channel count mode

“max”

Channel count

as many in the output as are in the input

Channel interpretation

“speakers”

The shaping curve is the important construct to understand here. It consists of an extract of a curve in an x-axis interval of [-1; 1] with values only between -1 and 1. At 0, the curve’s value is 0. By default, the curve array is null, which means that the WaveShaperNode will apply no modification to the input sound signal.

Creating a good shaping curve is an art form and requires a good understanding of mathematics. Here is a good explanation of how waveshaping works: http://music.columbia.edu/cmc/musicandcomputers/chapter4/04_06.php.

We’ll use y = 0.5x3 as our waveshaper. Figure 6-11 shows its shape.

9781484204610_Fig06-11

Figure 6-11. Waveshaper example

Listing 6-13 shows how to apply this function to a filter graph.

Listing 6-13. Applying a Waveshaper to an Input Signal

var audioCtx = new (window.AudioContext || window.webkitAudioContext)();
var mediaElement = document.getElementsByTagName(’audio’)[0];
var source = audioCtx.createMediaElementSource(mediaElement);

function makeDistortionCurve() {
  var n_samples = audioCtx.sampleRate;
  var curve = new Float32Array(n_samples);
  var x;
  for (var i=0; i < n_samples; ++i ) {
    x = i * 2 / n_samples - 1;
    curve[i] = 0.5 * Math.pow(x, 3);
  }
  return curve;
};

var distortion = audioCtx.createWaveShaper();
distortion.curve = makeDistortionCurve();
distortion.oversample = ’4x’;

mediaElement.addEventListener(’play’, function() {
  // create filter graph
  source.connect(distortion);
  distortion.connect(audioCtx.destination);
});

In the makeDistortionCurve() function we create the waveshaping curve by sampling the 0.5x3 function at the samplingRate of the AudioContext. Then we create the waveshaper with that shaping curve and 4x oversampling and put the filter graph together on the audio input file. When playing it back, you will notice how much quieter the sound got—that’s because this particular waveshaper only has values between -0.5 and 0.5.

The StereoPannerNode Interface

The StereoPannerNode represents a simple stereo panner node that can be used to pan an audio stream left or right and is created with the createStereoPanner() method of the AudioContext.

interface StereoPannerNode : AudioNode {
    readonly    attribute AudioParam pan;
};

It causes a given pan position to be applied to the input data before its propagation to the output.

Number of inputs

1

Number of outputs

1

Channel count mode

“clamped-max”

Channel count

2

Channel interpretation

“speakers”

This node always deals with two channels and the channelCountMode is always “clamped-max.” Connections from nodes with fewer or more channels will be upmixed or downmixed appropriately.

The pan parameter describes the new position of the input in the output’s stereo image.

·     -1 represents full left

·     +1 represents full right

Its default value is 0, and its nominal range is from -1 to 1. This parameter is a-rate.

Pan can be changed over time and thus create the effect of a moving sound source (e.g., from left to right). This is achieved by modifying the gain of the left and right channels.

Listing 6-14 shows an example of manipulating the pan position of an audio signal via a slider. Make sure to release the slider so the slider’s value actually changes when you try this for yourself. The filter graph consists of a MediaElementSourceNode, a StereoPannerNode, and an AudioDestinationNode.

Listing 6-14. Manipulating the Pan Position of an Audio Signal

<audio autoplay controls src="audio/Shivervein_Razorpalm.wav"></audio>
<input type="range"  min="-1" max="1" step="0.05" value="0"/>

<script>
var audioCtx = new (window.AudioContext || window.webkitAudioContext)();
var mediaElement = document.getElementsByTagName(’audio’)[0];

var source = audioCtx.createMediaElementSource(mediaElement);
var panNode = audioCtx.createStereoPanner();
source.connect(panNode);
panNode.connect(audioCtx.destination);

var slider = document.getElementsByTagName(’input’)[0];
slider.addEventListener(’change’, function() {
  panNode.pan.value = slider.value;
});
</script>

As you play this back, you might want to use headphones to get a better feeling of how the slider move affects the stereo position of the signal.

The ConvolverNode Interface

The ConvolverNode represents a processing node, which applies a linear convolution to an AudioBuffer and is created with the createConvolver() method of the AudioContext.

interface ConvolverNode : AudioNode {
                attribute AudioBuffer? buffer;
                attribute boolean      normalize;
};

We can imagine a linear convolver to represent the acoustic properties of a room and the output of the ConvolverNode to represent the reverberations of the input signal in that room. The acoustic properties are stored in something called an impulse response.

The AudioBuffer buffer attribute contains a mono, stereo, or four-channel impulse response used by the ConvolverNode to create the reverb effect. It is provided as an audio file itself.

The normalize attribute decides whether the impulse response from the buffer will be scaled by an equal-power normalization. It’s true by default.

Number of inputs

1

Number of outputs

1

Channel count mode

“clamped-max”

Channel count

2

Channel interpretation

“speakers”

It is possible for a ConvolverNode to take mono audio input and apply a two- or four-channel impulse response to result in a stereo audio output signal. Connections from nodes with fewer or more channels will be upmixed or downmixed appropriately, but a maximum of two channels is allowed.

Listing 6-15 shows an example of three different impulse responses applied to an audio file. The filter graph consists of a MediaElementSourceNode, a ConvolverNode, and an AudioDestinationNode.

Listing 6-15. Applying Three Different Convolutions to an Audio Signal

var audioCtx = new (window.AudioContext || window.webkitAudioContext)();
var mediaElement = document.getElementsByTagName(’audio’)[0];
var source = audioCtx.createMediaElementSource(mediaElement);
var convolver = audioCtx.createConvolver();

// Pre-Load the impulse responses
var impulseFiles = [
  "audio/filter-telephone.wav",
  "audio/kitchen.wav",
  "audio/cardiod-rear-levelled.wav"
];
var impulseResponses = new Array();
var allLoaded = 0;

function loadFile(url, index) {
  var request = new XMLHttpRequest();

  function requestData(url) {
    request.open(’GET’, url, true);
    request.responseType = ’arraybuffer’;
    request.send();
  }

  function receivedData() {
    if ((request.status === 200 || request.status === 206)
        && request.readyState === 4) {
      var audioData = request.response;
      audioCtx.decodeAudioData(audioData,
        function(buffer) {
          impulseResponses[index] = buffer;
          if (++allLoaded == impulseFiles.length) {
              createFilterGraph();
          }
        },
        function(error) {
          "Error with decoding audio data" + error.err
        }
      );
    }
  }

  request.addEventListener(’load’, receivedData, false);
  requestData(url);
}
for (i = 0; i < impulseFiles.length; i++) {
  loadFile(impulseFiles[i], i);
}

// create filter graph
function createFilterGraph() {
  source.connect(convolver);
  convolver.buffer = impulseResponses[0];
  convolver.connect(audioCtx.destination);
}

var radioButtons = document.getElementsByTagName(’input’);
for (i = 0; i < radioButtons.length; i++){
  radioButtons[i].addEventListener(’click’, function() {
    convolver.buffer = impulseResponses[this.value];
  });
}

You’ll notice that we’re loading the three impulse responses as in Listing 6-5 via XMLHttpRequest. Then we store them in an array, which allows us to change between them when the user switches between the input radio buttons. We can only put the filter graph together after all the impulse responses have been loaded (i.e., allLoaded = 2).

The HTML of this will have three input elements as radio buttons to switch between the different impulse responses. As you play with this example, you will notice the difference in reverberation between the “telephone,” the “kitchen,” and the “warehouse” impulse responses.

The ChannelSplitterNode and ChannelMergeNode Interfaces

The ChannelSplitterNode and the ChannelMergerNode represent AudioNodes for splitting apart and merging back together the individual channels of an audio stream in a filter graph.

The ChannelSplitterNode is created with the createChannelSplitter() method of the AudioContext, which takes an optional numberOfOutputs parameter which signifies the size of the fan-out AudioNodes. It’s 6 by default. Which of the outputs will actually have audio data depends on the number of channels available in the input audio signal to the ChannelSplitterNode. For example, fanning out a stereo signal to six outputs creates only two outputs with a signal—the rest are silence.

The ChannelMergerNode is the opposite of the ChannelSplitterNode and created with the createChannelMerger() method of the AudioContext, which takes an optional numberOfInputs parameter, signifying the size of the fan-in AudioNodes. It’s 6 by default, but not all of them need to be connected and not all of them need to contain an audio signal. For example, fanning-in six inputs with only the first two having a stereo audio signal each creates a six-channel stream with the first and second input downmixed to mono and the rest of the channels silent.

interface ChannelSplitterNode : AudioNode {};
interface ChannelMergerNode   : AudioNode {};

 

ChannelSplitterNode

ChannelMergeNode

Number of inputs

1

N (default: 6)

Number of outputs

N (default: 6)

1

Channel count mode

“max”

“max”

Channel count

Fan-out to a number of mono outputs

Fan-in a number of downmixed mono inputs

Channel interpretation

“speakers”

“speakers”

For ChannelMergerNode, the channelCount and channelCountMode properties cannot be changed—all inputs are dealt with as mono signals.

One application for ChannelSplitterNode and ChannelMergerNode is for doing “matrix mixing” where each channel’s gain is individually controlled.

Listing 6-16 shows an example of matrix mixing on our example audio file. You may want to use headphones to better hear the separate volume control of the left and right channels.

Listing 6-16. Apply Different Gains to Right and Left Channels of an Audio File

<audio autoplay controls src="audio/Shivervein_Razorpalm.wav"></audio>
<p>Left Channel Gain:
  <input type="range"  min="0" max="1" step="0.1" value="1"/>
</p>
<p>Right Channel Gain:
  <input type="range"  min="0" max="1" step="0.1" value="1"/>
</p>

<script>
var audioCtx = new (window.AudioContext || window.webkitAudioContext)();
var mediaElement = document.getElementsByTagName(’audio’)[0];

var source = audioCtx.createMediaElementSource(mediaElement);
var splitter = audioCtx.createChannelSplitter(2);
var merger = audioCtx.createChannelMerger(2);
var gainLeft = audioCtx.createGain();
var gainRight = audioCtx.createGain();

// filter graph
source.connect(splitter);
splitter.connect(gainLeft, 0);
splitter.connect(gainRight, 0);
gainLeft.connect(merger, 0, 0);
gainRight.connect(merger, 0, 1);
merger.connect(audioCtx.destination);

var sliderLeft = document.getElementsByTagName(’input’)[0];
sliderLeft.addEventListener(’change’, function() {
  gainLeft.gain.value = sliderLeft.value;
});
var sliderRight = document.getElementsByTagName(’input’)[1];
sliderRight.addEventListener(’change’, function() {
  gainRight.gain.value = sliderRight.value;
});
</script>

The example is straightforward with two input sliders individually controlling the volume of the two gain nodes, one for each channel. One thing to understand is the use of the second and third parameters of the AudioNode’s connect() methods on the AudioNodes, which allow connecting the separated channels of the ChannelSplitterNode or the ChannelMergerNode.

Figure 6-12 shows the filter graph of the example.

9781484204610_Fig06-12

Figure 6-12. Filter graph of volume control separately for left and right channel

The DynamicCompressorNode Interface

The DynamicCompressorNode provides a compression effect, which lowers the volume of the loudest parts of the signal to prevent clipping and distortion that can occur when multiple sounds are played and multiplexed together. Overall, a louder, richer, and fuller sound can be achieved. It is created with the createDynamicCompressor() method of the AudioContext.

interface DynamicsCompressorNode : AudioNode {
    readonly    attribute AudioParam threshold;
    readonly    attribute AudioParam knee;
    readonly    attribute AudioParam ratio;
    readonly    attribute float      reduction;
    readonly    attribute AudioParam attack;
    readonly    attribute AudioParam release;
};

Number of inputs

1

Number of outputs

1

Channel count mode

“explicit”

Channel count

2

Channel interpretation

“speakers”

The threshold parameter provides the decibel value above which the compression will start taking effect. Its default value is -24, with a nominal range of -100 to 0.

The knee parameter provides a decibel value representing the range above the threshold where the curve smoothly transitions to the compressed portion. Its default value is 30, with a nominal range of 0 to 40.

The ratio parameter represents the amount of change in dB needed in the input for a 1 dB change in the output. Its default value is 12, with a nominal range of 1 to 20.

The reduction parameter represents the amount of gain reduction in dB currently applied by the compressor to the signal. If fed no signal the value will be 0 (no gain reduction).

The attack parameter represents the amount of time (in seconds) to reduce the gain by 10 dB. Its default value is 0.003, with a nominal range of 0 to 1.

The release parameter represents the amount of time (in seconds) to increase the gain by 10 dB. Its default value is 0.250, with a nominal range of 0 to 1.

All parameters are k-rate.

Listing 6-17 shows an example of a dynamic compression.

Listing 6-17. Dynamic Compression of an Audio Signal

<audio autoplay controls src="audio/Shivervein_Razorpalm.wav"></audio>
<p>Toggle Compression: <button value="0">Off</button></p>

<script>
var audioCtx = new (window.AudioContext || window.webkitAudioContext)();
var mediaElement = document.getElementsByTagName(’audio’)[0];
var source = audioCtx.createMediaElementSource(mediaElement);

// Create a compressor node
var compressor = audioCtx.createDynamicsCompressor();
compressor.threshold.value = -50;
compressor.knee.value = 20;
compressor.ratio.value = 12;
compressor.reduction.value = -40;

mediaElement.addEventListener(’play’, function() {
  source.connect(audioCtx.destination);
});

var button = document.getElementsByTagName(’button’)[0];
button.addEventListener(’click’, function() {
  if (this.value == 1) {
    this.value = 0;
    this.innerHTML = "Off";
    source.disconnect(audioCtx.destination);
    source.connect(compressor);
    compressor.connect(audioCtx.destination);
  } else {
    this.value = 1;
    this.innerHTML = "On";
    source.disconnect(compressor);
    compressor.disconnect(audioCtx.destination);
    source.connect(audioCtx.destination);
  }
});
</script>

In the example, the compressor can be included and excluded from the filter graph by click of a button.

Figure 6-13 plots the compressor used. You can see that below -50 dB no compression is applied. Within the next 20 dB, a smooth transition is made to the compressed curve. The ratio determines how much compression is applied above the threshold and we have chosen a 12 dB change in input for a 1 dB change in output. A 1 dB change would result in no change—the larger the ratio value, the quicker the compression graph flattens out.

9781484204610_Fig06-13

Figure 6-13. Plot of the dynamic compression used in the example

The overall effect is that the audio signal is reduced in volume, but only in the previously high volume parts, not in the quieter parts.

This concludes our look at AudioNode interfaces that are standard functions for manipulating diverse aspects of the audio signal including gain, dynamics, delay, waveform, channels, stereo position, and frequency filters.

3D Spatialization and Panning

In this section we will look at the three-dimensional positioning of audio signals, which is of particular use in games when multiple signals need to be mixed together differently depending on the position of the listener. The Web Audio API comes with built-in hardware-accelerated positional audio features.

We deal with two constructs to manipulate 3D audio signals: the position of the listener and the PannerNode, which is a filter node to manipulate the sound’s position relative to the listener. The listener’s position is described by an AudioListener property in the AudioContextand a PannerNode is created through a function also part of the AudioContext.

[Constructor] interface AudioContext : EventTarget {
            ...
    readonly    attribute AudioListener        listener;
    PannerNode            createPanner ();
            ...
}

The AudioListener Interface

This interface represents the position and orientation of the person listening to the audio scene.

interface AudioListener {
    void setPosition    (float x, float y, float z);
    void setOrientation (float xFront, float yFront, float zFront,
                         float xUp, float yUp, float zUp);
    void setVelocity    (float x, float y, float z);
};

The AudioContext assumes a 3D right-handed Cartesian coordinate space in which the listener is positioned (see Figure 6-14). By default, the listener is standing at (0, 0, 0).

9781484204610_Fig06-14

Figure 6-14. Right-handed Cartesian coordinate space that the listener is standing in

The setPosition() method allows us to change that position. While the coordinates are unitless, typically you would specify position relative to a particular space’s dimensions and use percentage values to specify the position.

The setOrientation() method allows us to change the direction the listener’s ears are pointing in the 3D Cartesian coordinate space. Both a Front position and an Up position are provided. In simple human terms, the Front position represents which direction the person’s nose is pointing and defaults to (0, 0, -1), indicating that the z-direction is relevant for where the ears are pointing. The Up position represents the direction the top of a person’s head is pointing and defaults to (0, 1, 0), indicating that the y-direction is relevant to the person’s height. Figure 6-14 also show the Front and Up positions.

The setVelocity() method allows us to change the velocity of the listener, which controls both the direction of travel and the speed in 3D space. This velocity relative to an audio source’s velocity can be used to determine how much Doppler shift (pitch change) to apply. The default value is (0, 0, 0), indicating that the listener is stationary.

The units used for this vector are meters/second and are independent of the units used for position and orientation vectors. For example, a value of (0, 0, 17) indicates that the listener is moving in the direction of the z-axis at a speed of 17 meters/second.

Listing 6-18 shows an example of the effect of changing listener position and orientation. By default, the audio sound is positioned also at (0, 0, 0).

Listing 6-18. Changing the Listener’s Position and Orientation

<p>Position:
  <input type="range"  min="-1" max="1" step="0.1" value="0" name="pos0"/>
  <input type="range"  min="-1" max="1" step="0.1" value="0" name="pos1"/>
  <input type="range"  min="-1" max="1" step="0.1" value="0" name="pos2"/>
</p>
<p>Orientation:
  <input type="range"  min="-1" max="1" step="0.1" value="0" name="dir0"/>
  <input type="range"  min="-1" max="1" step="0.1" value="0" name="dir1"/>
  <input type="range"  min="-1" max="1" step="0.1" value="-1" name="dir2"/>
</p>
<p>Elevation:
  <input type="range"  min="-1" max="1" step="0.1" value="0" name="hei0"/>
  <input type="range"  min="-1" max="1" step="0.1" value="1" name="hei1"/>
  <input type="range"  min="-1" max="1" step="0.1" value="0" name="hei2"/>
</p>

<script>
var audioCtx = new (window.AudioContext || window.webkitAudioContext)();
var source = audioCtx.createBufferSource();

var request = new XMLHttpRequest();
var url = ’audio/ticking.wav’;
request.addEventListener(’load’, receivedData, false);
requestData(url);

var inputs = document.getElementsByTagName(’input’);
var pos = [0, 0, 0];   // position
var ori = [0, 0, -1];  // orientation
var ele = [0, 1, 0];   // elevation

for (i=0; i < inputs.length; i++) {
  var elem = inputs[i];
  elem.addEventListener(’change’, function() {
    var type = this.name.substr(0,3);
    var index = this.name.slice(3);
    var value = parseFloat(this.value);

    switch (type) {
      case ’pos’:
        pos[index] = value;
        audioCtx.listener.setPosition(pos[0], pos[1], pos[2]);
        break;
      case ’ori’:
        ori[index] = value;
        audioCtx.listener.setOrientation(ori[0], ori[1], ori[2],
                                         ele[0], ele[1], ele[2]);
        break;
      case ’ele’:
        ele[index] = value;
        audioCtx.listener.setOrientation(ori[0], ori[1], ori[2],
                                             ele[0], ele[1], ele[2]);
        break;
      default:
        console.log(’no match’);
    }
  });
}
</script>

In the example, we’re loading a looping sound into an AudioBuffer using functions introduced in Listing 6-5 and we manipulate the three dimensions of the three parameters.

Image Note  Thanks to Izkhanilov for making the “Ticking Clock.wav” sample available under a Creative Commons license on freesound (see www.freesound.org/people/Izkhanilov/sounds/54848/).

Interestingly, when you replicate our example, you will notice that the parameter changes make no difference to the sound playback. This is because the location of the sound source is not explicitly specified. We believe the AudioContext therefore assumes that the listener and the sound source are co-located. It requires a PannerNode to specify the location of the sound source.

The PannerNode Interface

This interface represents a processing node which positions/spatializes an incoming audio stream in 3D space relative to the listener. It is created with the createPanner() method of the AudioContext.

interface PannerNode : AudioNode {
    void setPosition    (float x, float y, float z);
    void setOrientation (float x, float y, float z);
                attribute PanningModelType  panningModel;
                attribute DistanceModelType distanceModel;
                attribute float             refDistance;
                attribute float             maxDistance;
                attribute float             rolloffFactor;
                attribute float             coneInnerAngle;
                attribute float             coneOuterAngle;
                attribute float             coneOuterGain;
};

One way to think of the panner and the listener is to consider a game environment where the antagonist is running through a 3D space and sounds are coming from all sorts of sources across the scene. Each one of these sources would have a PannerNode associated with them.

PannerNode objects have an orientation vector representing in which direction the sound is projecting. Additionally, they have a sound cone representing how directional the sound is. For example, the sound could be omnidirectional, in which case it would be heard anywhere regardless of its orientation, or it can be more directional and heard only if it is facing the listener. During rendering, the PannerNode calculates an azimuth (the angle that the listener has toward the sound source) and elevation (the height above or below the listener). The browser uses thesse values to render the spatialization effect.

Number of inputs

1

Number of outputs

1 (stereo)

Channel count mode

“clamped-max”

Channel count

2 (fixed)

Channel interpretation

“speakers”

The input of a PannerNode is either mono (one channel) or stereo (two channels). Connections from nodes with fewer or more channels will be upmixed or downmixed appropriately. The output of this node is hard-coded to stereo (two channels) and currently cannot be configured.

The setPosition() method sets the position of the audio source relative to the listener. Default is (0, 0, 0).

The setOrientation() method describes which direction the audio source is pointing in the 3D Cartesian coordinate space. Depending on how directional the sound is (controlled by the cone attributes), a sound pointing away from the listener can be very quiet or completely silent. Default is (1, 0, 0).

The panningModel attribute specifies which panning model this PannerNode uses.

enum PanningModelType {
    "equalpower",
    "HRTF"
};

The panning model describes how the sound spatialization is calculated. The “equalpower” model uses equal-power panning where the elevation is ignored. The HRTF (head-related transfer function) model uses a convolution with measured impulse responses from human, thus simulating human spatialization perception. The panningModel defaults to HRTF.

The distanceModel attribute determines which algorithm will be used to reduce the volume of an audio source as it moves away from the listener.

enum DistanceModelType {
    "linear",
    "inverse",
    "exponential"
};

The “linear” model assumes linear gain reduction as the sound source moves away from the listener. The “inverse” model assumes increasingly smaller gain reduction. The “exponential” model assumes increasingly larger gain reduction. The distanceModel defaults to “inverse.”

The refDistance attribute contains a reference distance for reducing volume as source moves further from the listener. The default value is 1.

The maxDistance attribute contains the maximum distance between source and listener, after which the volume will not get reduced any further. The default value is 10,000.

The rolloffFactor attribute describes how quickly the volume is reduced as the source moves away from the listener. The default value is 1.

The coneInnerAngle, coneOuterAngle, and coneOuterGain together describe a cone inside of which the volume reduction is much lower than outside. There is an inner and an outer cone describing the sound intensity as a function of the source/listener angle from the source’s orientation vector. Thus, a sound source pointing directly at the listener will be louder than if it is pointed off-axis.

Figure 6-15 describes the sound cone concept visually.

9781484204610_Fig06-15

Figure 6-15. Visualization of the source cone of a PannerNode in relation to the listener

The coneInnerAngle provides an angle, in degrees, inside of which there will be no volume reduction. The default value is 360, and the value used is modulo 360.

The coneOuterAngle provides an angle, in degrees, outside of which the volume will be reduced to a constant value of coneOuterGain. The default value is 360 and the value is used modulo 360.

The coneOuterGain provides the amount of volume reduction outside the coneOuterAngle. The default value is 0, and the value is used modulo 360.

Let’s extend the example of Listing 6-18 and introduce a PannerNode. This has the effect of positioning the sound source at a fixed location distinct from the listener position. We further include a sound cone so we can more easily hear the gain reduction effects. See the changes inListing 6-19.

Listing 6-19. Introducing a Position and Sound Cone for the Sound Source

var audioCtx = new (window.AudioContext || window.webkitAudioContext)();
var source = audioCtx.createBufferSource();
var panner = audioCtx.createPanner();
panner.coneOuterGain = 0.5;
panner.coneOuterAngle = 180;
panner.coneInnerAngle = 90;
source.connect(panner);
panner.connect(audioCtx.destination);

Now, as we change the position, orientation, and elevation, we move the listener relative to the sound source, which remains in the middle of the sound scene. For example, as we move the x-value of the position to the right, the sound moves to our left—we have sidestepped the sound to our right. Once we have the sound to our left, as we move the z-value of the orientation between -0.1 and +0.1, the sound moves between left and right—at 0 we are facing the sound, at +0.1 we have turned our right side to it, at -0.1 we have turned our left side to it.

Notice how we haven’t actually moved the location of the sound source yet, but only the position of the listener. You use the setPosition() and setOrientation() of the PannerNode for that and you can do this with multiple sounds. We’ll leave this as an exercise to the reader.

Image Note  The Web Audio API specification used to provide a setVelocity() method for the PannerNode which would calculate the Doppler shift for moving sound sources and listener. This has been deprecated and will be removed from Chrome after version 45. There is a plan to introduce a new SpatializerNode interface to replace this. For now, you will need to calculate Doppler shifts yourself, possibly using the DelayNode interface or changing the playbackRate.

JavaScript Manipulation of Audio Data

The current state of implementation of the Web Audio API specification includes an interface called ScriptProcessorNode, which can generate, process or analyze audio directly using JavaScript. This node type is deprecated with an aim of replacing it with the AudioWorkerinterface, but we will still explain it because it is implemented in current browsers while the AudioWorker interface isn’t.

The difference between the ScriptProcessorNode and the AudioWorker interface is that the first one is being run on the main browser thread and therefore has to share processing time with layout, rendering, and most other processing going on in the browser. All other audio nodes of the Web Audio API are running on a separate thread, which makes it more likely that the audio is running uninterrupted from other big tasks. This will change with the AudioWorker, which will also run the JavaScript audio processing on the audio thread. It will be able to run with less latency, because it avoids a change of thread boundaries and having to share resources with the main thread.

This all sounds great, but for now we cannot use the AudioWorker and will therefore look at the ScriptProcessorNode first.

The ScriptProcessorNode Interface

This interface allows writing your own JavaScript code to generate, process, or analyze audio and integrate it into a filter graph.

A ScriptProcessorNode is created through a createScriptProcessor() method on the AudioContext:

[Constructor] interface AudioContext : EventTarget {
...
    ScriptProcessorNode createScriptProcessor(
                 optional unsigned long bufferSize = 0 ,
                 optional unsigned long numberOfInputChannels = 2 ,
                 optional unsigned long numberOfOutputChannels = 2 );
...
}

It takes only optional parameters and it is recommended to leave the setting of these to the browser. However, following is their explanation:

·     a bufferSize in units of sampleframes being one of 256, 512, 1,024, 2,048, 4,096, 8,192, 16,384. This controls how frequently the audio process event is dispatched and how many sampleframes need to be processed in each call.

·     the numberOfInputChannels defaults to 2 but can be up to 32.

·     the numberOfOutputChannels defaults to 2 but can be up to 32.

The interface of the ScriptProcessorNode is defined as follows:

interface ScriptProcessorNode : AudioNode {
                attribute EventHandler onaudioprocess;
    readonly    attribute long         bufferSize;
};

The bufferSize attribute reflects the buffer size at which the node was created and the onaudioprocess associates a JavaScript event handler with the node, which is being called when the node is activated. The event that the handler receives is an AudioProcessingEvent.

interface AudioProcessingEvent : Event {
    readonly    attribute double      playbackTime;
    readonly    attribute AudioBuffer inputBuffer;
    readonly    attribute AudioBuffer outputBuffer;
};

It contains the following read-only data:

·     a playbackTime, which is the time when the audio will be played in the same time coordinate system as the AudioContext’s currentTime.

·     an inputBuffer, which contains the input audio data with a number of channels equal to the numberOfInputChannels parameter of the createScriptProcessor() method.

·     an outputBuffer, where the output audio data of the event handler is to be saved. It must have a number of channels equal to the numberOfOutputChannels parameter of the createScriptProcessor() method.

The ScriptProcessorNode does not change its channels or number of inputs.

Number of inputs

1

Number of outputs

1

Channel count mode

“explicit”

Channel count

Number of input channels

Channel interpretation

“speakers”

A simple example use of the ScriptProcessorNode is to add some random noise to the audio samples. Listing 6-20 shows an example of this.

Listing 6-20. Adding Random Noise to an Audio File in a ScriptProcessorNode

<audio autoplay controls src="audio/ticking.wav"></audio>

<script>
var audioCtx = new (window.AudioContext || window.webkitAudioContext)();
var mediaElement = document.getElementsByTagName(’audio’)[0];
var source = audioCtx.createMediaElementSource(mediaElement);

var noiser = audioCtx.createScriptProcessor();
source.connect(noiser);
noiser.connect(audioCtx.destination);

noiser.onaudioprocess = function(event) {
  var inputBuffer = event.inputBuffer;
  var outputBuffer = event.outputBuffer;

  for (var channel=0; channel < inputBuffer.numberOfChannels; channel++) {
    var inputData = inputBuffer.getChannelData(channel);
    var outputData = outputBuffer.getChannelData(channel);

    for (var sample = 0; sample < inputBuffer.length; sample++) {
      outputData[sample] = inputData[sample] + (Math.random() * 0.01);
    }
  }
};
</script>

Where we copy the input data to the output data array, we add 10% of a random number between 0 and 1 to the input data, thus creating an output sample with some white noise.

Image Note  You might want to check out the new AudioWorker interface in the specification and how it replaces the ScriptProcessorNode. We can’t describe it here, because at the time of this writing, it was still changing on a daily basis. The principle will be to create a JavaScript file containing the script that an AudioWorkerNode is supposed to execute, then call a createAudioWorker() function on the existing AudioContext to hand this script to a Worker, which executes it in a separate thread. There will be events raised between the AudioWorkerNodeand the AudioContext to deal with state changes in each thread and an ability to provide AudioParams to the AudioWorkerNode.

Offline Audio Processing

The OfflineAudioContext interface is a type of AudioContext that doesn’t render the audio output of a filter graph to the device hardware but to an AudioBuffer. This allows processing of audio data potentially faster than real time and is really useful if all you’re trying to do is analyze the content of your audio stream (e.g. when detecting beats).

[Constructor(unsigned long numberOfChannels,
             unsigned long length,
             float sampleRate)]
interface OfflineAudioContext : AudioContext {
               attribute EventHandler oncomplete;
    Promise<AudioBuffer> startRendering ();
};

The construction of an OfflineAudioContext works similarly to when you create a new AudioBuffer with the AudioContext’s createBuffer() method and takes the same three parameters.

·     The numberOfChannels attribute contains the number of discrete channels that the AudioBuffer should have.

·     The length attribute contains the length of the audio asset in sampleframes.

·     The sampleRate attribute contains the samplerate of the audio asset.

The OfflineAudioContext provides an oncomplete event handler, which is called when processing has finished.

It also provides a startRendering() method. When an OfflineAudioContext is created, it is in a “suspended” state. A call to this function kicks off the processing of the filter graph.

A simple example use of the OfflineAudioContext is to grab audio data from an audio file into an OfflineAudioContext without disturbing the general AudioContext, which may be doing some other work at the time. Listing 6-21 shows how this can be done by adjustingListing 6-5.

Listing 6-21. Adding Random Noise to an Audio File in a ScriptProcessorNode

// AudioContext that decodes data
var offline = new window.OfflineAudioContext(2,44100*20,44100);
var source = offline.createBufferSource();
var offlineReady = false;

// AudioContext that renders data
var audioCtx = new (window.AudioContext || window.webkitAudioContext)();
var sound;

var audioBuffer;

var request = new XMLHttpRequest();
var url = ’audio/transition.wav’;

function receivedData() {
  if ((request.status === 200 || request.status === 206)
      && request.readyState === 4) {
    var audioData = request.response;
    offline.decodeAudioData(audioData,
      function(buffer) {
        source.buffer = buffer;
        source.connect(offline.destination);
        source.start(0);
        offlineReady = true;
      },
      function(error) {
        "Error with decoding audio data" + error.err
      }
    );
  }
}

request.addEventListener(’load’, receivedData, false);
requestData(url);

function startPlayback() {
  sound = audioCtx.createBufferSource();
  sound.buffer = audioBuffer;
  sound.connect(audioCtx.destination);
  sound.start(0);
}

var stop = document.getElementsByTagName(’button’)[0];
stop.addEventListener(’click’, function() {
    sound.stop();
});

var start = document.getElementsByTagName(’button’)[1];
start.addEventListener(’click’, function() {
  if (!offlineReady) return;
  offline.startRendering().then(function(renderedBuffer) {
    audioBuffer = renderedBuffer;
    startPlayback();
  }).catch(function(err) {
    // audioData has already been rendered
    startPlayback();
  });
});

We’ve added a second button to the example of Listing 6-5 and are now manually starting the audio file. After downloading the audio file, the offline context is decoding it and starts rendering (when we click the start button). In the rendering routine, we save the decoded AudioBufferdata so we can reload it at a later stage. It’s this AudioBuffer data that we hand over to the live AudioContext for playback.

Audio Data Visualization

The final interface that we need to understand is the AnalyserNode interface. This interface represents a node that is able to provide real-time frequency and time-domain sample information. These nodes make no changes to the audio stream, which is passed straight through. They can therefore be placed anywhere in the filter graph. A major use of this interface is for visualizing the audio data.

An AnalyserNode is created through a createAnalyser() method on the AudioContext:

[Constructor] interface AudioContext : EventTarget {
...
    AnalyserNode    createAnalyser ();
...
}

The interface of the AnalyserNode is defined as follows:

interface AnalyserNode : AudioNode {
                attribute unsigned long fftSize;
    readonly    attribute unsigned long frequencyBinCount;
                attribute float         minDecibels;
                attribute float         maxDecibels;
                attribute float         smoothingTimeConstant;
    void getFloatFrequencyData  (Float32Array array);
    void getByteFrequencyData   (Uint8Array array);
    void getFloatTimeDomainData (Float32Array array);
    void getByteTimeDomainData  (Uint8Array array);
};

The attributes contain the following information:

·     fftSize: the buffer size used for the analysis. It must be a power of 2 in the range 32 to 32,768 and defaults to 2,048.

·     frequencyBinCount: a fixed value at half the FFT (Fast Fourier Transform) size.

·     minDecibels, maxDecibels: the power value range for scaling the FFT analysis data for conversion to unsigned byte values. The default range is from minDecibels = -100 to maxDecibels = -30.

·     smoothingTimeConstant: a value between 0 and 1 that represents the size of a sliding window that smooths results. A 0 represents no time averaging and therefore strongly fluctuating results. The default value is 0.8.

The methods copy the following data into the provided array:

·     getFloatFrequencyData, getByteFrequencyData: the current frequency data in different data types. If the array has fewer elements than the frequencyBinCount, the excess elements will be dropped. If the array has more elements than thefrequencyBinCount, the excess elements will be ignored.

·     getFloatTimeDomainData, getByteTimeDomainData: the current time-domain (waveform) data. If the array has fewer elements than the value of fftSize, the excess elements will be dropped. If the array has more elements than fftSize, the excess elements will be ignored.

The AnalyserNode does not change its channels or number of inputs and the output may be left unconnected:

Number of inputs

1

Number of outputs

1

Channel count mode

“max”

Channel count

1

Channel interpretation

“speakers”

Listing 6-22 shows a simple example of rendering the waveform into a canvas.

Listing 6-22. Rendering Waveform Data of an AudioContext

<audio autoplay controls src="audio/ticking.wav"></audio>
<canvas width="512" height="200"></canvas>

<script>
// prepare canvas for rendering
var canvas = document.getElementsByTagName("canvas")[0];
var sctxt = canvas.getContext("2d");
sctxt.fillRect(0, 0, 512, 200);
sctxt.strokeStyle = "#FFFFFF";
sctxt.lineWidth = 2;

// prepare audio data
var audioCtx = new (window.AudioContext || window.webkitAudioContext)();
var mediaElement = document.getElementsByTagName(’audio’)[0];
var source = audioCtx.createMediaElementSource(mediaElement);

// prepare filter graph
var analyser = audioCtx.createAnalyser();
analyser.fftSize = 2048;
analyser.smoothingTimeConstant = 0.1;
source.connect(analyser);
analyser.connect(audioCtx.destination);

// data from the analyser node
var buffer = new Uint8Array(analyser.frequencyBinCount);

function draw() {
  analyser.getByteTimeDomainData(buffer);

  // do the canvas painting
  var width = canvas.width;
  var height = canvas.height;
  var step = parseInt(buffer.length / width);
  sctxt.fillRect(0, 0, width, height);
  sctxt.drawImage(canvas, 0, 0, width, height);
  sctxt.beginPath();
  sctxt.moveTo(0, buffer[0] * height / 256);
  for(var i=1; i< width; i++) {
    sctxt.lineTo(i, buffer[i*step] * height / 256);
  }
  sctxt.stroke();
  window.requestAnimationFrame(draw);
}
mediaElement.addEventListener(’play’, draw , false);
</script>

We make use of a canvas into which the wave will be rendered and prepare it with a black background and white drawing color. We instantiate the AudioContext and the audio element for sample input, prepare the analyser and hook it all up to a filter graph.

Once the audio element starts playback, we start the drawing which grabs the waveform bytes from the analyser. These are exposed through a getByteTimeDomainData() method, which fills a provided Uint8Array. We take this array, clear the canvas from the previous drawing, and draw the new array into the canvas as a line connecting all the values. Then call the draw() method again in a requestAnimationFrame() call to grab the next unsigned 8-bit byte array for display. This successively paints the waveform into the canvas.

An alternative and more traditional way to using requestAnimationFrame would have been the use of the setTimeout() function with a 0 timeout. We recommend the use of requestAnimationFrame for all drawing purposes going forward because it is built for rendering and to make sure to properly schedule the drawing at the next possible screen repaint opportunity.

Figure 6-16 shows the result of running Listing 6-22.

9781484204610_Fig06-16

Figure 6-16. Rendering the audio waveform in the Web audio API

This concludes our exploration of the Web Audio API.