Beginning HTML5 Media. Make the most of the new video and audio standards for the Web (2015)



Accessibility, Internationalization, and Navigation

Accessibility and internationalization are two aspects of usability. The first —accessibility—is for those who have some form of sensory or physical impairment such as blindness. The second—internationalization—appeals to those who don’t speak the language used by the audio or the video file.

Since the mid-1990s, the Web has developed a vast set of functionalities to cope with the extra requirements of these users and their needs. Web sites present themselves in multiple languages, and screen readers or Braille devices provide vision-impaired users with the ability to consume web page content. Captioning of video, especially foreign language videos, or use of the @alt attribute for images, has become virtually ubiquitous and the use of @alt for images has been a best practice for a long time.

The introduction of audio and video into the HTML5 specification poses new accessibility challenges and needs to extend this best practice. For the first time, we are publishing audio content that needs to be made accessible to hearing-impaired users and/or users who do not speak the language used in the audio data. We are also publishing, for the first time, HTML imaging content that changes over time which needs to be made accessible to vision-impaired users.

Image Note  That last sentence may seem to be a bit out of place but video really is nothing more than a series of still images—keyframes separated by delta frames—that change over time.

What we must never forget is the word “World” in the term “World Wide Web.” Unlike media broadcasters who can pick and choose their audiences, our audience is composed of a polyglot of able and disabled people as well as varying cultures and languages. Everyone who accesses your video or audio content has just as much a right to have access to it as anyone else and you don’t get to pick and choose who views your content.

The primary means of addressing such needs has been the development of so-called alternative content technologies—or alt content—in which users are offered content that gives an alternative representation of the original content in a format they are able to consume. The practice of providing alt content was formalized in 1995 with the introduction of the @alt attribute in HTML 2 and has been a fixture of the specification since then.

When the W3C decided to introduce the <audio> and <video> tags into HTML5, the question of alt content became a major concern. For example, there are a number of choices for a simple video with a voiceover. They include the following:

·     Captions, which are alt content for the audio track for hearing-impaired users.

·     Subtitles, which are alt content for the audio track for foreign language users.

·     Video descriptions, which are alt content of the video track for vision-impaired users.

When publishing media content, all these alt choices should be published, too, so you don’t leave any audience members behind. Don’t regard it as a chore: alt content is generally useful as additional content, for example, in the case of subtitles or chapter markers for video, which help any user keep track of what is being spoken and navigate to useful locations within the video file.

In this chapter, we discuss the features offered by HTML5 to satisfy the accessibility and internationalization needs of media users. We start this chapter with a requirements analysis by providing an overview of alternative content technologies for media content. Then we introduce the features that HTML5 offers to satisfy the requirements.

Image Note  The creation of alt content for videos has important implications for all users on the Web—not just those with special needs or non-native users. The biggest advantage is that text, representing exactly what is happening in the video, is made available and this text is the best means for searches to take place. Just keep in mind, search technology is very advanced when it comes to text, but still quite restricted when it comes to the content of audio or video. It is for this reason that alt content provides the only reliable means of indexing audiovisual content for high-quality search.

Alternative Content Technologies

Throughout this book we have presented the various techniques that can be used to add audio and video assets to HTML5 documents in the form of “Feature-Example-Demonstration.” Before we get to this, we will present the many issues that web designers and developers confront in their efforts to cope with the growing demands of accessibility and internationalization.

The first issue to be confronted is legislative in nature.

More and more countries are passing accessibility laws regarding the Web. For example, in the United States, Section 504 of the 1973 Rehabilitation Act was the first civil rights legislation designed to protect the disabled from discrimination based on their disability status. Though the Internet, let alone personal computing, didn’t exist, the law applied to any employer or organization that received federal funds and this included government agencies, educational institutions ranging from K-12 to post-secondary, and any other federally funded project. In 1998, when the Internet boom was underway, Section 508 of the Reauthorized Rehabilitation Act created binding and enforceable standards clearly outlining and specifying what is meant by “accessible” electronic and information technology products. The upshot is that any web project developed for a company or organization receiving federal funds in the United States has to comply with Section 508. If you are unfamiliar with Section 508, a good overview is available at

Though accessibility policies vary from country to country, most countries —including those in the European Union—have adopted standards based on the W3C’s Web Content Accessibility Guidelines (—a set of standardized rules developed by the W3C to explain how to make web content accessible. WCAG was developed because, increasingly, it wasn’t only governments that were wrestling with the issue of accessibility but all major web content publishing sites. If you are unfamiliar with the W3C’s Web Accessibility Initiative, more information is available at What you do need to know is that many legislative measures are based on this group’s work.

The next issue that challenges web developers is the diversity of user requirements for alt content around audio and video, which is quite complex. If you want to learn more about media accessibility requirements, there is also a W3C document published by WAI and coauthored by one of the authors of this book:

Vision-Impaired Users

For users with poor or no vision, there are two major challenges: how to perceive the visual content of the video, and how to interact with and control media elements.

Perceiving Video Content

The method developed to aid vision-impaired users to consume the imagery content of video is Described Video . In this approach, a description of what is happening in the video is made available as the video’s time passes and the audio continues to play back. The following approaches are possible:

·     Audio descriptions: a speaker explains what is visible in the video as the video progresses.

·     Text descriptions: time-synchronized blocks of text are provided in time with what is happening on screen and a screen reader synthesizes this to speech for the vision-impaired user.

Image Note  It may be necessary to introduce pauses into the video to allow the insertion of extra explanations for which there is no time within the main audio track. This will extend the time it takes to consume the video. Therefore, such descriptions are called extended descriptions.

Audio descriptions can either be created as a separate audio recording added to the main audio track or mixed into the main audio recording and cannot be extracted again.

Such mixed-in audio description tracks can be provided as part of a multitrack video resource as long as they don’t extend the video’s timeline. One would create one audio track without the audio descriptions and a separate mixed-in track with the audio descriptions, so they can be activated as alternatives to each other.

When extended descriptions are necessary, mixed-in audio descriptions require using a completely separate video element, because they work on a different timeline. The production effort involved in creating such a new video file with mixed-in audio descriptions is, however, enormous. Therefore, mixing-in should only be used if there is no alternative means of providing described video.

Text descriptions are always provided as additional content to be activated on demand.

From a technical viewpoint, there are two ways of publishing described video.

·     In-band: Audio or text descriptions are provided as a separate track in the media resource to be activated either as an addition to the main audio track or as an alternative to it. Text descriptions are always activated additionally to the main audio track. Audio descriptions that are a separate recording are also activated additionally, while mixed-in descriptions are an alternative to the main audio track. This is actually quite similar to how descriptive audio is traditionally provided through secondary audio programming (see

·     External: Audio or text descriptions are provided as a separate resource. When the timelines match, HTML5 provides markup mechanisms to link the media resources together. The tracks in the external resource are then handled as “out-of-band” versions of the “in-band” tracks and HTML5 provides the same activation mechanisms as for in-band tracks. Browsers handle the download, interpretation, and synchronization of the separate resources during playback.

The multitrack media API (application programming interface) of HTML5 deals both with in-band and out-of-band audio and text description tracks. In-band captions are supported by Apple in Safari and Quicktime.

Interacting with Content

Vision-impaired users need to interact with described video in several ways.

·     Activate/Deactivate Descriptions. Where described video is provided through in-band tracks or external resources, the browsers can automatically activate or deactivate description tracks based on user needs specified in user preference settings in browsers. Explicit user control should also be available through interactive controls such as a menu of all available tracks and their activation status. Currently, browsers don’t provide such preference settings or menus for video descriptions. They can, however, be developed in JavaScript.

·     Navigate Within and into Media. Since audiovisual content is a major source of information for vision-impaired users, navigation within and into that content is very important. Sighted users often navigate through video by clicking on time offsets on a playback progress bar. This direct access functionality also needs to be available to vision-impaired users. Jumping straight into temporal offsets or into semantically meaningful sections of content helps the consumption of the content enormously. In addition, a more semantic means of navigating the content along structures such as chapters, scenes, or acts must also be available. Media fragment URIs (uniform resource identifiers) and WebVTT (Web Video Text Tracks) chapters provide for the direct access functionality.

Hard-of-Hearing Users

For users who have trouble hearing, the content of the audio track needs to be made available in an alternative form from audio. Captions, transcripts, and sign language translations have traditionally been used as alternatives. In addition, improvements to the played audio can also help hard-of-hearing people who are not completely deaf to grasp the content of the audio.

For captions, we distinguish between the following:

·     Traditional captions: blocks of text are synchronized with what is happening on screen and displayed time-synchronously with the video. Often they are overlaid at the bottom of the video viewport, sometimes placed elsewhere in the viewport to avoid overlapping other on-screen text, and sometimes placed underneath the viewport to avoid any overlap at all. Very little, if any, styling is applied to captions to make sure the text is readable with appropriate fonts, colors, and a means to separate it from the video colors through, for example, text outlines or a text background. Some captioned videos introduce color coding for speakers, speaker labeling, and/or positioning of the text close to the speakers on screen to further improve cognition and reading speed. HTML5 has introduced text tracks of kind “captions” and WebVTT to render captions in browsers.

·     Enhanced captions: in the modern web environment, captions can be so much more than just text. Animated and formatted text can be displayed in captions. Icons can be used to convey meaning—for example, separate icons for different speakers or sound effects. Hyperlinks can be used to link on-screen URLs to actual web sites or to provide links to further information making it easier to use the audiovisual content as a starting point for navigation. Image overlays can be used in captions to allow displaying timed images with the audiovisual content. To enable this use, general HTML markup is desirable in captions. It is possible to do so using WebVTT with text tracks of the kind “metadata.”

For now, we’ll concentrate on traditional captions.

Such captions are always authored as text but are sometimes added directly to the video imagery. This technique is called burnt-in captions, or “open captions” because they are always active and open for everyone to see. Traditionally, this approach has been used to deliver captions on TV and in cinemas because it doesn’t require any additional technology to be reproduced. This approach is, however, very inflexible. On the Web, this approach is discouraged, since it is easy to provide captions as text. Only legacy content where video without the burnt-in captions is not available should be published in this way. At best, video with burnt-in captions should be made available as a separate track in a multitrack media file or as a separate stream in a MediaSource, such that it is possible for user to choose between the video track with and the one without captions.

From a technical viewpoint, there are two ways of publishing captions.

·     In-band: burnt-in caption tracks or text captions are provided as a separate track in the media resource. This allows independent activation and deactivation of the captions. It requires web browsers to support handling of multitrack video.

·     External: text captions are provided as a separate resource and linked to the media resource through HTML markup. Similar to separate tracks, this allows independent activation and deactivation of the captions. It requires browsers to download, interpret, and synchronize the extra resource to the main resource during playback. This is supported in HTML5 through text tracks and WebVTT files.

If you have the choice, publish captions as separate external files because it is much simpler to edit them again later and do other text analysis on them than when they are mashed up with media data.

While we have specified the most common use cases, we must not forget that there are also cases for people with cognitive disabilities (dyslexia) or for learners requiring any of these alternative content technologies.


Full-text transcripts of the audio track of audiovisual resources are another means of making this content accessible to hard-of-hearing users and, in fact, to anyone. It can be more efficient to read—or cross-read—a transcript of an audio or video resource rather than having to sit through its full extent. One particularly good example is a site called Metavid, which has full transcripts of US senate proceedings and is fully searchable.

Two types of transcripts are typically used.

·     Plain transcripts: these are the equivalent of captions but brought together in a single block of text. This block of text can be presented simply as text on the web page somewhere around the video or as a separate resource provided through a link near the video.

·     Interactive transcripts: these are also equivalent to captions but brought together in a single block of text with a tighter relationship between the text and video. The transcript continues to have time-synchronized blocks such that a click on a specific text cue will navigate the audiovisual resource to that time offset. Also, as the video reaches the next text cue, the transcript will automatically move the new text cue center stage, for example, by making sure it scrolls to a certain on-screen location and/or is highlighted.

Incidentally, the latter type of interactive transcript is also useful for vision-impaired users as a navigation aid when used in conjunction with a screen reader. It is, however, necessary then to mute the audiovisual content while foraging through the interactive transcript, because otherwise it will compete with the sound from the screen reader and make both unintelligible.

Sign Translation

To hard-of-hearing users—in particular to deaf users—sign language is often the most proficient language they speak, followed by the written language of the country in which they live. They often communicate much quicker and more comprehensively in sign language, which—much like Mandarin and similar Asian languages—communicates typically through having a single symbol for semantic entities. Signs exist for letters, too, but sign speaking in letters is very slow and only used in exceptional circumstances. The use of sign language is the fastest and also most expressive means of communicating between hard-of-hearing users.

From a technical viewpoint, there are three ways of realizing sign translation.

·     Mixed-in: sign translation that is mixed into the main video track of the video can also be called burnt-in sign translation, or “open sign translation,” because it is always active and open for everyone to see. Typically, open sign translation is provided as a picture-in-picture (PIP) display, where a small part of the video viewport is used to burn in the sign translation. Traditionally, this approach has been used to deliver sign translation on TV and in cinemas because it doesn’t require any additional technology to be reproduced. This approach is, however, very inflexible since it forces all users to consume the sign translation without possibilities for personal choice, in particular without allowing the choice of using a different sign language (from a different country) for the sign translation.

On the Web, this approach is discouraged. Sign translation that is provided as a small PIP video is particularly hard to see in the small, embedded videos that are typical for Web video. Therefore only legacy content where video without the burnt-in sign translation is not available should be published in this way. Where possible, the sign translation should exist as separate content.

·     In-band: sign translation is provided as a separate track in the media resource. This allows independent activation and deactivation of the extra information. It requires web browsers to support handling of multitrack video.

·     External: sign translation is provided as a separate resource and linked to the media resource through HTML markup. Similar to separate tracks, this allows independent activation and deactivation of the extra information. It requires browsers to synchronize the playback of two video resources.

Clear Audio

This is a feature that is not alternative content for the hearing-impaired, but a more generally applicable feature that improves the usability of audio content. It is commonly accepted that speech is the most important part of an audio track, since it conveys the most information. In modern multitrack content, speech is sometimes provided as a separate track apart from the sound environment. A good example would be karaoke music content, but clear audio content can also easily be provided for professionally developed video content, such as movies, animations, or TV series.

Many users have problems understanding the speech in a mixed audio track. But, when the speech is provided in a separate track, it is possible to allow increasing the volume of the speech track independently of the rest of the audio tracks, thus rendering “clearer audio”—that is, more comprehensible speech.

Technically, this can only be realized if there is a separate speech track available, either as a separate in-band track or as a separate external resource.

Deaf-Blind Users

It is very hard to provide alternative content for users who can neither see nor hear. The only means of consumption for them is basically Braille, which requires text-based alternative content.

Individual Consumption

If deaf-blind users consume the audiovisual content by themselves, it makes sense to provide a transcript that contains a description of what is happening both on screen and in the audio. It’s basically a combination of a text video description and an audio transcript. The technical realization of this is thus best as a combined transcript. Interestingly, Braille devices are very good at navigating hypertext, so some form of transcript enhanced with navigation markers is also useful.

Shared Viewing Environment

In a shared viewing environment where the deaf-blind user consumes the content together with a seeing and/or hearing person, the combination of text and audio descriptions needs to be provided synchronously with the video playback. A typical Braille reading speed is 60 words per minute. Compare that to the average adult reading speed of around 250 to 300 words per minute or even a usual speaking speed of 130–200 words per minute and you realize that it will be hard for a deaf-blind person to follow along with any normal audiovisual presentation. A summarized version may be necessary, which can still be provided in sync just as text descriptions are provided in sync and can be handed through to a Braille device. The technical realization of this is thus either as an interactive transcript or through a summarized text description.

Learning Support

Some users prefer to slow down the playback speed to assist them in perceiving and understanding audiovisual content; for others, the normal playback speed is too slow. In particular vision-impaired users have learned to digest audio at phenomenal rates. For such users, it is very helpful to be able to slow down or speed up a video or audio resource’s playback rate. Such speed changes require keeping the pitch of the audio to maintain its usability.

A feature that can be very helpful to those with learning disabilities is the ability to provide explanations. For example, whenever an uncommon word is used, it can be very helpful to pop up an explanation of the term (e.g., through a link to Wikipedia or to a dictionary). This is somewhat analogous to the aims of enhanced captions and can be provided in the same manner through allowing hyperlinks and/or overlays.

With learning material, we can also provide grammatical markup of the content in time-synchronicity. This is often used for linguistic research but can also help people with learning disabilities to understand the content better. Grammatical markup can be augmented onto captions or subtitles to provide a transcription of the grammatical role of the words in the given context. Alternatively, the grammatical roles can be provided just as markers for time segments, relying on the audio to provide the actual words.

Under the learning category we can also subsume the use case of music lyrics or karaoke. These provide, like captions, a time-synchronized display of the spoken (or sung) text for users to follow along. Here, they help users learn and understand the lyrics. Similar to captions, they can be technically realized through burning-in, in-band multitrack, or external tracks.

Foreign Users

Users who do not speak the language that is used in the audio track of audiovisual content are regarded as foreign users. Such users also require alternative content to allow them to comprehend the media content.

Scene Text Translations

The video track typically poses only a small challenge to foreign users. Most scene text is not important enough to be translated or can be comprehended from context. However, sometimes there is on-screen text such as titles that explain the location, for which a translation would be useful. It is recommended to include such text in the subtitles.

Audio Translations

There are two ways in which an audio track can be made accessible to a foreign user.

·     Dubbing: provide a supplementary audio track that can be used as a replacement for the original audio track. This supplementary audio track can be provided in-band with a multitrack audiovisual resource, or external as a linked resource, where playback needs to be synchronized.

·     (Enhanced) Subtitles: provide a text translation of what is being said in the audio track. This supplementary text track can be provided burnt-in, in-band, or as an external resource, just like captions. And just like captions, burnt-in subtitles are discouraged because of their inflexibility.

Technology Summary

When analyzing the different types of technologies that are necessary to provide alternatives to the original content and satisfy special user requirements, we can see that they broadly fall into the following different classes:

·     Burnt-in: this type of alternative content is actually not provided as an alternative but as part of the main resource. Since there is no means to turn this off (other than through signal processing), no HTML5 specifications need to be developed to support them.

·     Page text: this type covers the transcriptions that can be consumed either in relation to the video or completely independent of it.

·     Synchronized text: this type covers text, in-band or external, that is displayed in sync with the content and includes text descriptions, captions, and subtitles.

·     Synchronized media: this type covers audio or video, in-band or external, that is displayed in sync with the content and includes audio descriptions, sign translation, and dubbing.

·     Navigation: this is mostly a requirement for vision-impaired users or mobility-impaired users but is generally useful to all users.

With a basic understanding of what we are dealing with and the needs to be satisfied, let’s take a look at some methods for making media accessible. We’ll start with the obvious: transcripts.


Transcripts are a method designed to provide full-text transcripts—either interactive or plain text—of the audio track found in audiovisual resources. This is a great means of making audiovisual content accessible to hard-of-hearing users and in fact to anyone. It can be more efficient to read—or cross-read—a transcript of an audio or video resource rather than having to sit through its full extent. One site that provides transcripts (see Figure 4-1) with the video content is


Figure 4-1. One site that provides transcripts is

In the following code listings, we’ve kept the numbering scheme such that you can easily find the code example on from the listing number. For example, Listings 4-1a and 4-1b are from example 1 in Chapter 4.

The code block in Listing 4-1a shows an example of how to link a plain transcript to a media element.

Listing 4-1a. Providing a Plain Transcript for a Video Element

<video poster="img/ElephantDreams.png" controls>
  <source src="video/ElephantDreams.mp4"  type="video/mp4">
  <source src="video/ElephantDreams.webm" type="video/webm">
    <a id="videoTranscript" href="ElephantDreams.html">
    Read the transcript for this video.</a>

In this example the transcript is a linked html document named ElephantDreams.html. When the video opens the link appears under the video. This allows a (hearing-impaired) user to read about the content of the video. The transcript, Listing 4-1b, is a very basic HTML document.

Listing 4-1b. The Transcript Is a Very Basic HTML Document

<!DOCTYPE html>
<html lang="en">
    <title>Media Accessibility Demo</title>
    Transcript: <a href="../media/video_elephant.ogv">Elephant’s Dream</a>
      Screen text: "The orange open movie project presents"
      [Introductory titles are showing on the background of a water pool with
       fishes swimming and mechanical objects lying on a stone floor.]
      "Elephant’s Dream"
      Proog: At the left we can see... At the right we can see the...the
      head-snarlers. Everything is safe.
      Perfectly safe.
      [Two people stand on a small bridge.]
      Proog: Emo? Emo! Watch out!

When you open the file in a browser (see Figure 4-2), you will see the HTML page of the video with the hyperlink underneath it. Click the link and the transcript HTML page opens.


Figure 4-2. Plain external transcript linked to a video element

The transcript shown in Figure 4-2 has a transcription of both the spoken text and of what is happening in the video. This makes sense, since the transcript is independent of the video and as such it must contain everything that happens in the video. It also represents both a text description and a transcript, making it suitable for deaf-blind users once rendered into Braille.

Interactive Transcripts

In the previous example the transcript was presented in the form of a separate HTML document that opened in its own window. In many respects this is a static method of providing a transcript. Interactive transcripts provide the experience in a whole different manner. They not only provide a transcription of the spoken text and what is happening in the video but they also move in time with the video and don’t require a separate window.

Currently there is no HTML5 specification that provides such an interactive transcript via markup. Therefore, interactivity has to be accomplished through the use of JavaScript and a series of HTML <div> elements to hold the text cues for the screen reader.

The HTML markup of an example can be seen in the following code in Listing 4-2a:

Listing 4-2a. The HTML Provides the Timing and the Transcript

<div id="videoBox">
    <video poster="img/ElephantDreams.png" controls>
      <source src="video/ElephantDreams.mp4"  type="video/mp4">
      <source src="video/ElephantDreams.webm" type="video/webm">
<div id="speaking" aria-live="rude">
<div id="transcriptBox">
  <h4>Interactive Transcript</h4>
  <p style="font:small;">Click on text to play video from there.</p>
  <div id="transcriptText">
    <p id="c1" class="cue" data-time="0.0" aria-live="rude" tabindex="1">
      [Screen text: "The orange open movie project presents"]
    <p id="c2" class="cue" data-time="5.0" aria-live="rude" tabindex="1">
      [Introductory titles are showing on the background of a water pool
      with fishes swimming and mechanical objects lying on a stone floor.]
    <p id="c3" class="cue" data-time="12.0" aria-live="rude" tabindex="1">
      [Screen text: "Elephant’s Dream"]
    <p id="c4" class="cue" data-time="15.0" tabindex="1">
      Proog: At the left we can see...  At the right we can see the... the
      Everything is safe. Perfectly safe. Emo? Emo!

Next to the <video> element, we provide a <div> element with the id of speaking. This element is given the text cues that a screenreader should read out. It has an @aria-live attribute for this, which tells the screenreader to read out whatever text has changed inside the element as soon as the change happens. This provides a simple means of rendering text descriptions for vision-impaired users.

Next, we provide a scrollable <div> named transcriptBox to display the transcript in. Each cue within transcriptBox is provided with a @data-time attribute, which contains its start time and a @tabindex to allow vision-impaired users to navigate through it by pressing the Tab key. A cue implicitly ends with the next cue.

Figure 4-3 shows the result that we want to achieve.


Figure 4-3. Interactive transcript for a video element

The JavaScript that creates the interactivity and the renders the text descriptions is shown in Listing 4-2b.

Listing 4-2b. JavaScript Provides the Interactivity for a Transcript

window.onload = function() {
  // get video element
  var video = document.getElementsByTagName("video")[0];
  var transcript = document.getElementById("transcriptBox");
  var trans_text = document.getElementById("transcriptText");
  var speaking = document.getElementById("speaking");
  var current = -1;

  // register events for the clicks on the text
  var cues = document.getElementsByClassName("cue");
  for (var i=0; i<cues.length; i++) {
    cues[i].addEventListener("click", function(evt) {
      var start = parseFloat(this.getAttribute("data-time"));
      video.currentTime = start;;
    }, false);

  // pause video as you mouse over transcript
  transcript.addEventListener("mouseover", function(evt) {
  }, false);

  // scroll to text as video time changes
  video.addEventListener("timeupdate", function(evt) {
    if (video.paused || video.ended) {
    // scroll to currently playing time offset
    for (var i=0; i<cues.length; i++) {
      var cueTime = cues[i].getAttribute("data-time");
      if (cues[i].className.indexOf("current") == -1 &&
          video.currentTime >= parseFloat(cueTime) &&
          video.currentTime < parseFloat(cueTime)) {
        trans_text.scrollTop =
          cues[i].offsetTop - trans_text.offsetTop;
        if (current >= 0) {
        cues[i].className += " current";
        current = i;
        if (cues[i].getAttribute("aria-live") == "rude") {
          speaking.innerHTML = cues[i].innerHTML;
  }, false);

As you can see, the JavaScript handles the following functions:

·     Register an onclick event handler on the cues, such that it is possible to use them to navigate around the video.

·     Register an onmouseover event handler on the transcription box, such that the video is paused as soon as you move the mouse into the transcription box for navigation.

·     Register an ontimeupdate event handler on the video, which checks the scrolling position of the text and scrolls it up as necessary, sets a background color on the currently active cue, and also checks the value of the @aria-live attribute of the cue, such that if it’s not spoken in the video, the respective content is read out by a screenreader.

The elements, as designed here, work both for vision- and hearing-impaired users. As you click the Play button on the video, the video plays back normally and the caption text that is part of the interactive transcript is displayed in a scrolling display on the right, highlighting the current cue. If you have a screenreader enabled, the markup in the transcript that has been marked with an @aria-live attribute is copied to the screenreader to be read out at the appropriate time. Click on a piece of text and the video moves to that position in the playback.

The <track> Element: Subtitles, Captions, and Text Descriptions

Now that you know how transcripts can be included with your video projects, let’s turn our attention to captions, subtitles, and descriptions, which are typically authored separately from your web page. As such, HTML5 has introduced special markup and APIs (application programming interfaces) to automatically synchronize these external files with the video’s timeline.

In this section we focus on the <track> element and its API. They have been introduced into HTML and let you associate a time-based text file with a media resource. This text file—usually a WebVTT or .vtt file—can be used in a number of ways including adding subtitles, captions, and text descriptions of the media content.

Image Note  It is worth mentioning that browsers may also support other file formats in the <track> element. For example, IE10 supports both WebVTT and TTML (Timed Text Markup Language). TTML is often used by the captioning industry to interchange captions between authoring systems, see We won’t discuss TTML in more detail here, because it is only supported in IE and other browsers have explicitly stated that they are not interested in implementing support for it.

WebVTT is a new standard and is supported by all browsers implementing the <track> element. WebVTT provides a simple, extensible, and human-readable format on which to build text tracks.

We are going to get deeper into the details of the WebVTT format’s features in the next section. However, to make use of the <track> element, you will need to use a basic .vtt file. If you are unfamiliar with this format, a basic understanding is helpful. WebVTT files are UTF-8 text files that consist simply of a “WEBVTT” file identifier followed by a series of so-called cues containing a start and end time and some cue text. Cues need to be separated from each other by empty lines. For example, a simple WebVTT file would be


00:00:15.000 --> 00:00:17.951
At the left we can see...

00:00:18.166 --> 00:00:20.083
At the right we can see the...

The first line—WEBVTT—must be in all capital letters and is used by the browser to check if it really is a .vtt file. IE10 actually requires this header to be “WEBVTT FILE” and since other browsers ignore the extra text, you might as well always author files with that identifier.

The time markers in the cues provide the duration of the cues expressed as hr:min:sec.mms and the cue text is the text appears on the screen. In this case, the words “At the left we can see…” will be visible from 15 seconds to 17.951 seconds of the video’s playback timeline. Any word processor or editor that creates a plain text file can be used to create a .vtt file.

Image Note  WebVTT is the modern version of what was formerly called WebSRT. For those of you who already have projects containing subtitles in SRT, you will find VTT follows a very similar approach. There is a no-frills converter available at

With the .vtt file created, it needs to be tied to the <track> element. This element is placed inside either the <audio> or <video> elements and references external time-synchronized text resources—a .vtt file—that align with the <audio> or <video> element’s timeline. In <video> elements, captions and subtitles are rendered on top of the video viewport. Since <audio> elements have no viewport, <track> elements that are children of <audio> elements are not rendered and just made available to script.

Image Note  IE10 requires that .vtt files are served with a mime type of “text/vtt”; otherwise it will ignore them. So, make sure your Web server has this configuration (e.g., for Apache you need to add this to the mime.types file). In your browser page inspector, you can check the “content-type” HTTP header that the Web browser downloads for a .vtt file to confirm that your server is providing the correct mime type.

Let’s look at the <track> element’s content attributes.


Naturally, this attribute references an external text track file. Listing 4-3 is a simple code example of a track element with a reference to a WebVTT file.

Listing 4-3. Example of <track> Markup with a .vtt File

<video controls poster="img/ElephantDreams.png">
  <source src="video/ElephantDreams.mp4"  type="video/mp4">
  <source src="video/ElephantDreams.webm" type="video/webm">
  <track src="tracks/ElephantDreams_en.vtt">

The @src attribute only creates the reference to an external text track file. It does not activate it, but it allows browsers to make a listing of the referenced tracks available to the user. This is normally displayed via a menu in the video’s controls.

Image Note  If you are working along with us, it is important to run the <track> examples on a Web server and not locally. Documents loaded from file URLs have special security restrictions in blink-based browsers to stop malicious scripts you may have saved to your desktop from doing bad things. For Chrome, you can also run it with a command-line flag like this to avoid the issue: chrome --disable-web-security.

Figure 4-4 shows the resulting display of Listing 4-3 in Safari (left) and Google Chrome (right).


Figure 4-4. Video element with a track child element in Safari and Google Chrome

Safari, as shown on the left in Figure 4-4, has a menu behind a speech bubble on the video controls. You activate the menu by clicking on the speech bubble. The track we defined is listed as “Unknown” in the menu. Activate that track by clicking on it and you can watch the rendered subtitles.

Google Chrome, on the right, shows a “CC” button through which you can activate captions and subtitles. If you click that button and watch the video, you will be able to see the subtitles that loaded from the .vtt file rendered on top of the video.

Opera looks identical to Google Chrome. In Firefox, no subtitle activation button is available as yet. We will explain how you can still activate a subtitle track via markup next. Alternatively, you can also do so from JavaScript, which we will also look at later in this chapter.

Internet Explorer (see Figure 4-5) is a combination of Safari and Chrome. It includes the CC button which, when clicked, shows you the name of the track. Click the name and the subtitles are rendered.


Figure 4-5. Video element with a track child element in Internet Explorer

Image Note  You may have noticed Safari and Internet Explorer provide the most useful visual activation and selection mechanisms for text tracks: a menu activated from the video controls. All browsers intend to implement this feature, but not all of them have reached that state. Google Chrome and Opera show a single “CC” button for now, which activates the most appropriate subtitle track (e.g., English if your browser language is set to English).


The next attribute—@default—allows a Web page author to pick a text track and mark it as activated by default. It’s a Boolean attribute meaning the default has the same value as the Boolean true value. Here, Listing 4-4 provides an example:

Listing 4-4. Example of <track> Markup with a .vtt File, Activated with @default

<video controls poster="img/ElephantDreams.png">
  <source src="video/ElephantDreams.mp4"  type="video/mp4">
  <source src="video/ElephantDreams.webm" type="video/webm">
  <track src="tracks/ElephantDreams_en.vtt" default>

You can see in Figure 4-6 how the “CC” button in Opera on the left and the menu selection in Safari on the right are automatically turned on. Google Chrome, like Opera, automatically turns on the “CC” button.


Figure 4-6. Video element with @default activated track child element in Opera (left) and Safari (right)

Now you can also play back the video in Firefox and see the subtitles display unlike the results in the previous example. By using the @default attribute, the subtitles, as shown in Figure 4-7, contained in the .vtt file are now activated. Just be aware that the subtitles will be hidden by the video controls if the user places the cursor at the bottom of the video.

Internet Explorer, as shown in Figure 4-7, not only activates captions but shows you the default track that is playing. It’s currently called “untitled,” so we need to give it a proper name.


Figure 4-7. Video element with @default activated track child element in Firefox and Internet Explorer


We just learned that tracks not given a name are given a random label of “Unknown” or “untitled” in their track selection menu. We can fix this by providing an explicit @label attribute.

Listing 4-5 provides an example.

Listing 4-5. Example of <track> Markup with a .vtt File and a @label

<video controls poster="img/ElephantDreams.png">
  <source src="video/ElephantDreams.mp4"  type="video/mp4">
  <source src="video/ElephantDreams.webm" type="video/webm">
  <track src="tracks/ElephantDreams_en.vtt" default label="English">

Figure 4-8 shows how the label renders in Safari. “English” is much more obvious and understandable than “Unknown.”


Figure 4-8. Video element with a named track child element using @label in Safari left) and Internet Explorer (right)


Now that the track has a label, the user can tell that it’s an English track. This is important information that should not be hidden from the browser. If we let the browser know that we’re dealing with an English track, then the browser can decide to autoactivate this track when a user starts watching videos with subtitles in their preferred language: “English.” The browser retrieves such user preferences from the browser’s or the operating system’s settings. To enable browsers to pick the right tracks for the user’s preferences, we have the @srclang attribute, which is given an IETF (Internet Engineering Task Force) language code according to BCP47 to distinguish different tracks from each other.

Image Note  Browsers haven’t yet extended their browser preferences to include preference settings about the activation of text tracks. However, some browsers use platform settings to deal with this, in particular Safari.

Also note that there are other valid uses for providing information about the track resource’s language in @srclang (e.g., Google indexing or automatic translation).

Listing 4-6 shows an example of how to use @srclang.

Listing 4-6. Example of <track> Markup with a .vtt File and a @label

<video controls poster="img/ElephantDreams.png">
  <source src="video/ElephantDreams.mp4"  type="video/mp4">
  <source src="video/ElephantDreams.webm" type="video/webm">
  <track src="tracks/ElephantDreams_en.vtt" srclang="en">

A key aspect of Listing 4-6 is that it doesn’t include the @label or @default attributes. The only attribute is @srclang. When rendered in Safari, the track, as shown in Figure 4-9, is still labeled “English” in the menu. This figure also shows the native OSX Accessibility Preference settings for captions. In this case captions appear as large text in Safari, and this choice also turns the “Auto (Recommended)” track selection on which, in turn, activates the English track.


Figure 4-9. Video element with a @srclang attribute and default activation on the platform


The @kind attribute specifies the type of text track you are dealing with and the available @kind attribute values. These values are:

·     subtitles : transcription or translation of the dialogue, suitable for when the sound is available but not understood (e.g., The user does not understand the language of the media resource’s soundtrack). Such tracks are suitable for internationalization purposes.

·     captions : transcription or translation of the dialog, sound effects, relevant musical cues, and other relevant audio information, suitable for when the soundtrack is unavailable (e.g., the dialog is muted, drowned-out by ambient noise, or because the user is deaf). Such tracks are suitable for hard-of-hearing users.

·     descriptions: textual descriptions of the video component of the media resource, useful for audio synthesis when the visual component is obscured, unavailable, or not usable (e.g., the user is interacting with the application because the user is blind). To be synthesized as audio. Such tracks are suitable for vision-impaired users.

·     chapters: chapter titles are to be used for navigating the media resource. Such tracks are displayed as an interactive (potentially nested) list in the browser’s interface.

·     metadata : tracks intended for use from JavaScript. The browser does not render these tracks.

If no @kind attribute is specified, the value defaults to “subtitles,” which is what we experienced in the previous examples.

Tracks that are marked as subtitles or captions will be rendered, if activated, in the video viewport. Only one caption or subtitle track can be activated at any one point in time. This also means that only one of these tracks should be authored with a @default attribute—otherwise the browser does not have a clue which one gets activated by default.

Tracks marked as descriptions, if activated, will synthesize their cues into audio—possibly via the screen reader API. Since screen readers are also the intermediaries to Braille devices, this is sufficient to make the descriptions accessible to vision-impaired users. Only one descriptionstrack can be active at any one time.

Image Note  At the time of this writing, no browser supports such “rendering” of descriptions. There are, however, two Chrome extensions that render descriptions: one that uses Chrome’s text-to-speech API ( and one that uses a screenreader (if installed): (

Tracks marked as chapters are provided for navigation purposes. It is expected this feature will be realized in browsers through a menu or other form of navigation markers on the media controls’ timeline. No browser, as of yet, natively supports chapter rendering.

Finally, tracks marked as metadata will not be rendered visually, but only exposed to JavaScript. A web developer can do anything with this metadata, which can consist of any text web page scripts can decode. This includes JSON, XML, or any other special-purpose markup as well as image URLs to provide thumbnails of the video for navigation or subtitles with hyperlinks such as those used in advertising.

Listing 4-7 is a code example containing each of these track types.

Listing 4-7. Example of <track> Markup with a Track of Each @kind

<video controls poster="img/ElephantDreams.png">
  <source src="video/ElephantDreams.mp4"  type="video/mp4">
         <source src="video/ElephantDreams.webm" type="video/webm">
        <track src="tracks/ElephantDreams_zh.vtt" srclang="zh" kind="subtitles">
        <track src="tracks/ElephantDreams_jp.vtt" srclang="ja" kind="captions">
        <track src="tracks/ElephantDreams_en.vtt"
                   srclang="en" kind="metadata" label="Metadata">
        <track src="tracks/ElephantDreams_chapters_en.vtt"
                   srclang="en" kind="chapters"  label="Chapters">
        <track src="tracks/ElephantDreams_audesc_en.vtt"
                   srclang="en" kind="descriptions" label="Descriptions">

This example contains the following:

·     Chinese subtitles: srclang="zh" kind="subtitles"

·     Japanese captions: srclang="ja" kind="captions"

·     English metadata: srclang="en" kind="metadata" label="Metadata"

·     English Chapters: srclang="en" kind="chapters" label="Chapters"

·     English Descriptions: srclang="en" kind="descriptions" label="Descriptions"

When viewed in Safari (see Figure 4-10), all of the tracks are exposed. Selecting any of the chapters, descriptions, or metadata tracks don’t result in any rendering. It is surprising Safari even lists them in the menu.


Figure 4-10. Video element in Safari with multiple tracks of different @kind

After selecting the Japanese caption track, we can see (Figure 4-11) the UTF-8 encoded characters rendered correctly on top of the video viewport.


Figure 4-11. Video element with Japanese caption track activated

Despite browsers being a bit behind in implementing the buttons and menus for controlling text tracks, third-party players have started taking advantage of the <track> element and its rendering of captions and subtitles.

For example, JWPlayer, which we explored in Chapter 3, supports captions, chapters, and thumbnails in a “metadata” track contained in a WebVTT file. As shown in Figure 4-12, it renders them with no frills. You can see that chapters are rendered with little markers on the timeline and when you hover over them you get the title of that chapter. You can also see, when you hover over the JWPlayer timeline, when thumbnails are provided, they pop up.


Figure 4-12. JWPlayer rendering captions, chapters tracks, and preview thumbnails via WebVTT

Image Note  Examples of WebVTT for JWPlayer are at,, and

The WebVTT markup used for the thumbnail timeline is as follows in Listing 4-8:

Listing 4-8. Example WebVTT File for a Track of Kind “Metadata” with Thumbnails


00:00:00.000 --> 00:00:30.000

00:01:00.000 --> 00:01:30.000

The thumbnails were created with the following command-line ffmpeg command, one every 30 seconds:

$ ffmpeg -i video.mp4 -f image2 -vf fps=fps=1/30 thumb%d.png

In-band Text Tracks

WebVTT files don’t necessarily have to be linked externally through the <track> element. They can also be embedded directly into the video file. These are known as in-band tracks . Due to the mp4 and webm formats being container formats, a WebVTT file can be added directly to the container, typically by being multiplexed into the file as a data track. This is a relatively new technique and the browsers are just now starting to add in-band support. To learn more about this emerging technique for the various formats we suggest you start with the following sites:

·     WebM has a specification for storing WebVTT in-band:

·     MPEG-4 has a specification for embedding WebVTT in-band:

·     MPEG DASH can deal with WebVTT:

·     and so can Apple’s HLS:

At the moment there are no visual editors that will embed a WebVTT track into a media file. There are, however, a couple of command line approaches to adding these tracks to an mp4 or webm file.

You can use MP4Box ( to author WebVTT in MPEG-4 and ffmpeg to author WebVTT in WebM.

Here is an example for how to create a mp4 file with a WebVTT track using mp4box:

$ mp4box -add Monty_subs_en.vtt:FMT=VTT:lang=en Monty_subtitles.mp4

This command adds the monty_subs_en.vtt subtitle track to Monty_subtitles.mp4.

Following is an example for how to create a webm file with a WebVTT track using ffmpeg:

$ ffmpeg -i Monty.mp4 -i Monty_subs_en.vtt -metadata:s:s:0 kind="captions" \
         -scodec copy Monty_subtitles.webm

It tells ffmpeg to use Monty.mp4 as the input media file, tells it to use Monty_subs_en.vtt as the input file for WebVTT captions to be copied into the WebM file, and gives the subtitle track a kind of “captions.”

Though this is a relatively new technique, HTML5 has made it such that in-band text tracks are exposed in Web browsers identically to external tracks that are defined in <track>. This means that the same JavaScript API is available regardless of the text-track’s origin.

Image Note  As a Web developer you can choose to publish your WebVTT files as independent files or make use of video files that have WebVTT in-band. Browser support, at the time of this writing, is not consistent. With this in mind we recommend using external text track files—.vtt files—not in-band tracks, until such time the browsers have a consistent implementation of in-band text tracks.

JavaScript API: Flexibility for Web Developers

As we pointed out in Chapter 3, JavaScript can be used to extend the functionality of the various elements of a web page. In this case JavaScript can be used to manipulate the text tracks used in a media source whether that text track is in-band or external to the media. This opens up a number of creative possibilities to web developers and designers looking to produce accessible video or audio content. In this section we review the JavaScript API as it pertains to external text tracks.

We start with the <track> element.

Track Element

The IDL (Interface Definition Language) interface of the track element looks as follows:

interface HTMLTrackElement : HTMLElement {
            attribute DOMString kind;
            attribute DOMString src;
            attribute DOMString srclang;
            attribute DOMString label;
            attribute boolean default;
   const unsigned short NONE = 0;
   const unsigned short LOADING = 1;
   const unsigned short LOADED = 2;
   const unsigned short ERROR = 3;
   readonly attribute unsigned short readyState;
   readonly attribute TextTrack track;

This IDL is the object that represents a <track> element. It is available when external text tracks are listed. The IDL attributes kind, src, srclang, label, and default contain the value of the content attributes of the same names as introduced earlier. As with the audio and video element, the remaining DOM attributes reflect the current state of the track element.


The @readyState IDL is a read-only attribute that represents the current readiness state of the track element. The available states are as follows:

·     NONE(0): indicates that the text track’s cues have not been obtained.

·     LOADING(1): indicates that the text track is loading and there have been no fatal errors encountered so far. Further cues might still be added to the track by the parser.

·     LOADED(2): indicates that the text track has been loaded with no fatal errors.

·     ERROR(3): indicates that the text track was enabled, but when the user agent attempted to obtain it, this failed in some way (e.g., URL could not be resolved, network error, and unknown text track format). Some or all of the cues are likely missing and will not be obtained.

The readiness state of a text track changes dynamically as the track is obtained.

It is useful, as a JavaScript developer, for you to make sure all of the text tracks that are expected to be loaded actually did load and didn’t result in an ERROR. If you are displaying your own menu of available subtitle tracks, this is particularly important since you may only want to display tracks for selection if they can actually be loaded.


As discussed earlier, the objects that are created by In-band text tracks and external <track> referenced text tracks is identical. They are instantiations of the TextTrack object. This attribute links to the TextTrack object of the respective <track> element.

Image Note  In the next examples we are going to be using an extract of a video called “A Digital Media Primer for Geeks” by Monty Montgomery (published under the Creative Commons Attribution NonCommercial ShareAlike License; see We thank the Foundation for making this video available—the full video and others in the series are well worth checking out.

To get a better feeling for how the attributes in the Track Element’s IDL work together, Listing 4-9 displays the value of all the IDL attributes of a track element on load and then the readyState just after playback starts.

Listing 4-9. IDL Attributes of the <track> Element

<video poster="img/Monty.jpg" controls width="50%">
  <source src="video/Monty.mp4"  type="video/mp4">
  <source src="video/Monty.webm" type="video/webm">

  <track label="English" src="tracks/Monty_subs_en.vtt" kind="subtitles"
         srclang="en" default>
<h3>Attribute values:</h3>
<p id="values"></p>
var video = document.getElementsByTagName(’video’)[0];
var track = document.getElementsByTagName(’track’)[0];
var values = document.getElementById(’values’);
values.innerHTML += "Kind: " + track.kind + "<br/>";
values.innerHTML += "Src: " + track.src + "<br/>";
values.innerHTML += "Srclang: " + track.srclang + "<br/>";
values.innerHTML += "Label: " + track.label + "<br/>";
values.innerHTML += "Default: " + track.default + "<br/>";
values.innerHTML += "ReadyState: " + track.readyState + "<br/>";
values.innerHTML += "Track: " + track.track + "<br/>";

function loaded() {
    values.innerHTML += "ReadyState: " + track.readyState + "<br/>";
video.addEventListener("loadedmetadata", loaded, false);

Figure 4-13 shows the result from Firefox.


Figure 4-13. IDL attributes of a <track> element that is activated by default

All the information regarding the <track> element’s attribute values, including kind, src, srclang, label, and default are represented. You can also see the readyState is, at first, LOADING(1) and when the video starts playing, it changes to LOADED(2).

Before we get to the content of the track attribute, let’s briefly list the events that may be fired at the <track> element.


An onload event is fired at the HTMLTrackElement when the resource referenced in the @src attribute is successfully loaded by the browser—the readyState then also changes to LOADED(2).


An onerror event is fired at the HTMLTrackElement when the resource referenced in the @src attribute fails to load. The readyState then also changes to ERROR(3).


An oncuechange event is fired at the HTMLTrackElement when a cue in that track becomes active or stops being active.

Listing 4-10 is a good example of a code block that captures these events.

Listing 4-10. Catching Events on the <track> Element

<video poster="img/Monty.jpg" controls width="50%" autoplay>
  <source src="video/Monty.mp4"  type="video/mp4">
  <source src="video/Monty.webm" type="video/webm">

  <track label="Australian" src="tracks/Monty_subs_au.vtt" kind="subtitles"
         srclang="en-au" default>
  <track label="English" src="tracks/Monty_subs_en.vtt" kind="subtitles"
<p id="values"></p>
var video = document.getElementsByTagName(’video’)[0];
var tracks = document.getElementsByTagName(’track’);
var values = document.getElementById(’values’);

function trackloaded(evt) {
    values.innerHTML += "Track loaded: " +
                     + " track<br/>";
function trackerror(evt) {
    values.innerHTML += "Track error: " + + " track<br/>";
    tracks[1].track.mode = "showing";
function cuechange(evt) {
    values.innerHTML += "Cue change: " + + " track<br/>";
for (var i=0; i < tracks.length; i++) {
    tracks[i].onload = trackloaded;
    tracks[i].onerror = trackerror;
    tracks[i].oncuechange = cuechange;

We’ve deliberately defined and activated by @default a first text track whose @src resource—Monty_subs_au.vtt—does not exist. The result is the triggering of that first error event mentioning the Australian track. In the error event callback we’re activating the second track—Monty_subs_en.vtt—which, in turn, activates the load callback. Then later, when the video playback reaches the first cue, the cuechange event is activated and pauses the video.

Running this in Google Chrome gives us the result shown in Figure 4-14.


Figure 4-14. Catching events on the <track> element

Image Note  There are bugs in the implementation of these events in browsers. For example, Firefox doesn’t seem to raise the load and cuechange events, and Safari doesn’t raise the cuechange event.

Now that we understand the events that can be fired at the HTMLTrackElement, we can turn our attention to the content of the @track attribute, which is a TextTrack object.

TextTrack Object

A TextTrack object is created for every text track that is associated with a media element. This object is created regardless of whether

·     it comes from an external file through the <track> element,

·     it comes through an in-band text track of a media resource; or

·     it is created completely in JavaScript via the addTextTrack() method of the HTMLMediaElement which we will get to later in this chapter.

A TextTrack object’s attribute values are thus sourced either from the HTMLTrackElement’s attribute values, from in-band values (see, or from the parameters of the addTextTrack() method.

Image Note  A TextTrack object originating from a <track> element is linked both from the HTMLTrackElement object and from the TextTrackList of the media element, which is a child of the <track> element. In-band tracks and script-created tracks only exist in theTextTrackList of the media element.

The IDL of the TextTrack object looks as follows:

enum TextTrackMode { "disabled",  "hidden",  "showing" };
enum TextTrackKind { "subtitles",  "captions",  "descriptions",  "chapters",  "metadata" };

interface TextTrack : EventTarget {
   readonly attribute TextTrackKind kind;
   readonly attribute DOMString label;
   readonly attribute DOMString language;
   readonly attribute DOMString id;
   readonly attribute DOMString inBandMetadataTrackDispatchType;
            attribute TextTrackMode mode;
   readonly attribute TextTrackCueList? cues;
   readonly attribute TextTrackCueList? activeCues;
   void addCue(TextTrackCue cue);
   void removeCue(TextTrackCue cue);
            attribute EventHandler oncuechange;

The first four attributes are as follows:

·     The @kind attribute is restricted by the TextTrackKind object to the legal values that we learned earlier in the <track> element.

·     The @label attribute contains the label string as provided from either the <track> element’s @label attribute, from a field of an in-band track, or from the label parameter of the addTextTrack() method of the HTMLMediaElement.

·     The @language attribute contains the language string either provided from the <track> element’s @srclang attribute, from a field of an in-band track, or from the language parameter of the addTextTrack() method of the HTMLMediaElement.

·     The @id attribute contains the identifier string as provided either from the <track> element’s @id attribute (every element has such an attribute), or from an identifier field of an in-band track.

The remaining attributes in the IDL need a bit more explanation.


This is a string extracted from the media resource specifically for a text track of @kind “metadata.” This string explains the exact format of the data in the cues, so adequate JavaScript functions can be set to parse and display that data.

For example, text tracks with particular content formats could contain metadata for ad targeting, trivia game data during game shows, play states during sports games, or recipe information during cooking shows. As such, dedicated script modules could be bound to parsing such tracks using the value of this attribute.

How the data formats are identified is specified in Since this attribute is very specific to particular kinds of applications and has a rather negligible impact on accessibility, further discussion of this attribute is beyond the scope of this book.


As defined by the TextTrackMode type, a TextTrack object can have three different modes.

·     Disabled: indicates the text track is not active. In this case, the browser has identified the existence of a <track> element, but it hasn’t downloaded the external track file or parsed it. No cues are active and no events are fired. <track>-defined text tracks that are not activated by default end up in this state initially.

·     Hidden: indicates that the text track’s cues have been or should be obtained, but they are not being shown. The browser is maintaining a list of which cues are active, and events are being fired accordingly. In-band text tracks and JavaScript-created text tracks end up in this state initially.

·     Showing: indicates that the text track’s cues have been or should be obtained and are being displayed if they are of an @kind that renders. The browser is maintaining a list of which cues are active and which events are being fired accordingly. <track>-defined text tracks that are activated with the @default attribute end up in this state initially.


This is the list of loaded TextTrackCues once the TextTrack has become active (i.e., mode is hidden or showing). For continuously loading media files, this list may update continuously as the media resource continuous to parse the in-band text track.


This is the list of TextTrackCues on the TextTrack that are currently active. Active cues are those that start before the current playback position and end after the playback position.

Before we move on to the methods and events used by the TextTrack object, Listing 4-11 gives us an opportunity to inspect the <track> element’s IDL attributes.

Listing 4-11. IDL Attributes of <track> Element’s @track Attribute

<video poster="img/Monty.jpg" controls width="50%">
  <source src="video/Monty.mp4"  type="video/mp4">
  <source src="video/Monty.webm" type="video/webm">

  <track id="track1" label="English" src="tracks/Monty_subs_en.vtt"
         kind="subtitles" srclang="en" default>
<h3>TextTrack object:</h3>
<p id="values"><b>Before loading:</b><br/></p>
var video = document.getElementsByTagName(’video’)[0];
var track = document.getElementsByTagName(’track’)[0];
var values = document.getElementById(’values’);
values.innerHTML += JSON.stringify(track.track, undefined, 4) + "<br/>";
values.innerHTML += "track.cues length: " + track.track.cues.length
                    + "<br/>";

function loaded() {
    values.innerHTML += "<b>After loading:</b><br/>";
    values.innerHTML += "track.cues[0]: " + track.track.cues[0] + "<br/>";
    values.innerHTML += "track.cues length: " + track.track.cues.length;
video.addEventListener("loadeddata", loaded, false);

Figure 4-15 shows the value of the TextTrack object in the @track attribute before the <track> element is loaded and the number of cues after it’s loaded in Google Chrome. You’ll see that the length of @cues is 0 (because “cues” is null before loading and 51 afterward).


Figure 4-15. IDL attribute values of <track> element’s @track attribute shown in Opera

Firefox doesn’t actually create the TextTrack object that early which means the @track attribute is still an empty object prior to loading, but it reports on the number of cues consistently.

addCue( )

This method adds a TextTrackCue object to the text track’s list of cues. That means that the object is added to @cues, and also to @activeCues if the media element’s current time is within that cue’s time interval. Note that if the given cue is already in another text track list of cues, then it is removed from that text track list of cues before it is added to this one.

removeCue( )

This method removes a TextTrackCue object from the text track’s list of cues.

onCueChange Event

The cueChange event is raised when one or more cues in the track have become active or stopped being active.

Listing 4-12 provides a JavaScript example of applying the addCue() and removeCue() methods on a TextTrack of a <track> element and capturing the resulting cuechange event.

Listing 4-12. Methods and Events of a <track> Element’s @track Attribute

var video = document.getElementsByTagName(’video’)[0];
var track = document.getElementsByTagName(’track’)[0];
var values = document.getElementById(’values’);

function loaded() {
    var cue = new VTTCue(0.00, 5.00, "This is a script created cue.");
    values.innerHTML += "Number of cues: " + track.track.cues.length
                     + "<br/>";
    values.innerHTML += "<b>After adding cue:</b><br/>"
    values.innerHTML += "Number of cues: " + track.track.cues.length
                     + "<br/>";
video.addEventListener("loadedmetadata", loaded, false);

function playing() {
    values.innerHTML += "<b>After play start:</b><br/>"
    values.innerHTML += "Number of cues: " + track.track.cues.length
                     + "<br/>";
    values.innerHTML += "First cue: "
                     + JSON.stringify(track.track.cues[0].text) + "<br/>";
    function cuechanged() {
        values.innerHTML += "<b>After removing cue:</b><br/>"
        values.innerHTML += "Number of cues: " + track.track.cues.length
                         + "<br/>";
    track.track.addEventListener("cuechange", cuechanged, false);
video.addEventListener("play", playing, false);

After loading the video, we create a VTTCue—new VTTCue—which is a kind of TextTrackCue. We start with 51 cues and end up with 52. After starting playback, we have all the 52 cues loaded and then register a cuechange event upon which cue 1 is removed to get back to 51 cues. Figure 4-16 shows the result in Google Chrome. Also note that the first cue is the script created cue in the list of 52 cues.


Figure 4-16. Methods and events of a TextTrack object

Image Note  This example doesn’t work correctly in Firefox because Firefox doesn’t support the oncuechange event on the TextTrack object yet.


The cues in the @cues and @activeCues attributes of the TextTrack IDL have the following format:

interface TextTrackCue : EventTarget {
   readonly attribute TextTrack? track;
            attribute DOMString id;
            attribute double startTime;
            attribute double endTime;
            attribute boolean pauseOnExit;
            attribute EventHandler onenter;
            attribute EventHandler onexit;

These are the basic attributes of a cue. Specific cue formats such as VTTCue can further extend these attributes. Here’s a quick review of the TextTrackCue attributes.


This is the TextTrack object to which this cue belongs, if any, or null otherwise.


This is an identifying string for the cue.

@startTime, @endTime

These are the start and end times of the cue. They relate to the media element’s playback time and define the cue’s active time range.


The @pauseOnExit flag is a Boolean that indicates whether playback of the media resource is to pause when the end of the cue’s active time range is reached. It may, for example, be used to pause a video when reaching the end of a cue in order to introduce advertising.

onenter and onexit Events

The enter event is raised when the cue becomes active and the exit event is raised when it stops being active.


The @cues and @activeCues attributes of the TextTrack IDL are TextTrackCueList objects of the following format:

interface TextTrackCueList {
   readonly attribute unsigned long length;
   getter TextTrackCue (unsigned long index);
   TextTrackCue? getCueById(DOMString id);

@length returns the length of the list.

The getter makes it possible to access a cue list element by index (e.g., cues[i]).

The getCueById() function allows retrieving a TextTrackCue by providing its id string.

Listing 4-13 demonstrates how one can go through the list of cues of a track and access the cue attributes.

Listing 4-13. Access the Attributes of All the Cues of a Text Track

<video poster="img/Monty.jpg" controls width="50%">
  <source src="video/Monty.mp4"  type="video/mp4">
  <source src="video/Monty.webm" type="video/webm">

  <track id="track1" label="English" src="tracks/Monty_subs_en.vtt"
         kind="subtitles" srclang="en" default>
<h3>TextTrack object:</h3>
            <td>Cue Number</td>
    <tbody id="values">
var video = document.getElementsByTagName(’video’)[0];
var track = document.getElementsByTagName(’track’)[0];
var values = document.getElementById(’values’);
var content;

function loaded() {
    for (var i=0; i < track.track.cues.length; i++) {
        content = "<tr>";
        content += "<td>" + i + "</td>";
        content += "<td>" + track.track.cues[i].id + "</td>";
        content += "<td>" + track.track.cues[i].startTime + "</td>";
        content += "<td>" + track.track.cues[i].endTime + "</td>";
        content += "<td>" + track.track.cues[i].text + "</td></tr>";
        values.innerHTML += content;
video.addEventListener("loadedmetadata", loaded, false);

When you test this in the browser, as shown in Figure 4-17, you will see that this technique helps to quickly introspect the cues to ensure the order, timing, and text spelling is correct.


Figure 4-17. Listing all cues of a text track

Media Element

We’ve seen how we can access <track> elements, their list of cues, and the content of each of the cues. Now we will take a step away from <track> elements alone and return to the media element’s list of text tracks. This will include in-band text tracks and script-created tracks, too.


First we need to understand the TextTrackList object:

interface TextTrackList : EventTarget {
   readonly attribute unsigned long length;
   getter TextTrack (unsigned long index);
   TextTrack? getTrackById(DOMString id);
            attribute EventHandler onchange;
            attribute EventHandler onaddtrack;
            attribute EventHandler onremovetrack;

Similar to the TextTrackCueList object, a TextTrackList is a list of TextTrack objects.

The list’s length is given in the @length attribute.

Individual tracks can be accessed by index (e.g., track[i]).

The getTrackById() method allows retrieving a TextTrack by providing its id string.

In addition, a “change” event is raised whenever one or more tracks in the list have ben enabled or disabled, an addtrack event is raised whenever a track has been added to the track list, and a removetrack event whenever a track has been removed.

To get access to all the text tracks that are associated with an audio or video element, the IDL of the MediaElement is extended with the following attribute and method:

interface HTMLMediaElement : HTMLElement {
  readonly attribute TextTrackList textTracks;
  TextTrack addTextTrack(TextTrackKind kind, optional DOMString label = "",
                         optional DOMString language = "");


The @textTracks attribute of media elements is a TextTrackList object that contains the list of text tracks that are available for the media element.

addTextTrack( )

This new method for media elements addTextTrack (kindlabellanguage) is used to create a new text track for a media element purely from JavaScript with the given kind, label, and language attribute settings. The new track, if valid, is immediately in the LOADED(2) @readyStateand in “hidden” @mode with an empty @cues TextTrackCueList.

We mentioned earlier that @textTracks contains all tracks that are associated with a media element, regardless of whether they were created by a <track> element, exposed from an in-band text track, or created by JavaScript through the addTextTrack() method. The tracks are actually always accessed in the following order:

1.    <track> created TextTrack objects, in the order that they are in the DOM.

2.    addTextTrack() created TextTrack objects, in the order they were added, oldest first.

3.    In-band text tracks, in the order given in the media resource.

In Listing 4-14 we have a subtitle track created via a <track> element and a script-created chapters track.

Listing 4-14. List and Access All the Text Tracks of a Video Element

<video poster="img/Monty.jpg" controls width="50%">
  <source src="video/Monty.mp4"  type="video/mp4">
  <source src="video/Monty.webm" type="video/webm">
  <track id="track1" label="English" src="tracks/Monty_subs_en.vtt"
         kind="subtitles" srclang="en" default>
<h3>TextTrack object:</h3>
<p id="values"></p>
var video = document.getElementsByTagName(’video’)[0];
var values = document.getElementById(’values’);
var new_track = video.addTextTrack("chapters", "English Chapters", "en");

var cue;
cue = new VTTCue(0.00, 7.298, "Opening Credits");
cue = new VTTCue(7.298, 204.142, "Introduction");

function loaded() {
    values.innerHTML += "Number of text tracks: "
                     + video.textTracks.length + "</br>";
    for (var i=0; i < video.textTracks.length; i++) {
        values.innerHTML += "<b>Track[" + i + "]:</b></br>";
        values.innerHTML += "Number of cues: "
                         + video.textTracks[i].cues.length + "<br/>";
        values.innerHTML += "First cue: "
                         + video.textTracks[i].cues[0].text + "<br/>";
video.addEventListener("loadedmetadata", loaded, false);

When the video has finished loading, we display the number of text tracks and, for each of them, their number of cues and the text within the first cue. Figure 4-18 shows the result.


Figure 4-18. Listing all the text tracks of a video element

This concludes the description of the JavaScript objects and their APIs, which allow us to deal with general text tracks and retrieve relevant information or react to specific events.

WebVTT: Authoring Subtitles, Captions, Text Descriptions and Chapters

Though we provided a quick look at WebVTT and how it can be used earlier in this chapter, we are going to devote this section of the chapter to a deep-dive into this subject. This will include formatting the cues and captions and positioning them on the video.

As we pointed out, WebVTT is a file format specifically defined to allow authors to create text track cues independently of web pages and to distribute them in separate files. A web page author does not typically create video content; therefore, it would make no sense to require subtitles to be authored as part of web pages.

We have also seen that a simple WebVTT file is a text file consisting of a WEBVTT string signature followed by a list of cues separated by empty lines. Following is an example file:


1 this is an identifier
00:00:08.124 --> 00:00:10.742
Workstations and high end personal computers have been able to

00:00:10.742 --> 00:00:14.749
manipulate digital audio pretty easily for about fifteen years now.

00:00:14.749 --> 00:00:17.470
It’s only been about five years that a decent workstation’s been able

As you can see, each cue starts with string, which is the optional identifier. The next line contains the start and end times for the display of the cue expressed in the form of hh:min:sec.mms and separated by the “-->” string. Just be aware each of hour and minute segments must consist of two digits such as 01 for one hour or one minute. The second segment must consist of two digits and three decimal places for the milliseconds.

The next line or lines of text are the actual cue content, which is the text to be rendered.

This file, which will be referenced by the <track> element, is nothing more than a simple text file with the .vtt extension in the file name. When the cues are displayed and are of kind “subtitles” or “captions,” they appear in a black box over the bottom middle of the video viewport. Naturally, this raises an obvious question: can these cues be “jazzed” up?” The answer is: “Yes.” Let’s look at some ways of working with these cues.

Cue Styling

Cues in the WebVTT format can be styled using CSS, once they are available to a web page. Using the preceding example, you could use the ::cue pseudo-selector to:

·     style the first cue using

video::cue(#\31\ this\ is\ an\ identifier) { color: green; }

·     style the second cue using

video::cue(#transcription\-line) { color: red; }

·     style the third cue using

video::cue(#\33) { color: blue; }

·     style all cues using

video::cue { background-color: lime; }

Image Note  CSS allows less freedom in the selector strings than WebVTT, so you need to escape some characters in your identifiers to make this work. See for more information.

The CSS properties that can be applied to the text in cue and are used with::cue include:

·     “color”

·     “opacity”

·     “visibility”

·     “text-decoration”

·     “text-shadow”

·     the properties corresponding to the “background” shorthand

·     the properties corresponding to the “outline” shorthand

·     the properties corresponding to the “font” shorthand, including

·        “line-height”

·        “whitespace”

With ::cue() you can additionally style all the properties relating to the transition and animation features.

Cue Markup

Cues are of the kind “metadata” and can contain anything in the cue content. This includes things like JSON (JavaScript Object Notation), XML, or data URLs.

Cues of other kinds contain restricted cue text. Cue text contains plain text in UTF-8 plus a limited set of tags. The ampersand (&) and the less-than sign (<) have to be escaped as characters because they represent the start of an escaped character sequence or a tag. The following escaped entities are used, just like they are in HTML: & (&), < (<), > (>), &lrm; (left-right-mark), &rlm; (right-left-mark), and   (non-breaking space). The left-right and right-left marks are non-printing characters that allow for changing the directionality of the text as part of internationalization and bidirectional text. This is critically important when marking up script in languages such as Hebrew or Arabic, which render words from right to left, or when marking up mixed-language text.

Next, we’ll list the currently defined tags and give simple examples on how to address them with CSS from an HTML page once such a cue is included and displayed on a web page. The tags are:

·     Class span <c>: to mark up a section of text for styling, for example,

<c.myClass>Apply styling to this text</c>

This will allow using CSS selectors like the following:

::cue(.myClass) { font-size: 2em; }

You can use a .myClass class attribution on all tags.

·     Italics span <i>: to mark up a section of italicized text, for example,

<i>Apply italics to this text</i>

This also allows using CSS selectors like the following:

::cue(i) { color: green; }

·     Bold span <b>: to mark up a section of bold text, for example,

<b>Apply bold to this text</b>

This also allows using CSS selectors like the following:

::cue(b) { color: red; }

·     Underline span <u>: to mark up a section of underlined text, for example,

<u>Apply underlines to this text</u>

This also allows using CSS selectors like the following:

::cue(u) { color: blue; }

·     Ruby span <ruby>: to mark up a section of ruby annotations.

Ruby annotations are short runs of text presented alongside base text, primarily used in East Asian typography as a guide for pronunciation or to include other annotations. Following is a markup exampl\e:


This also allows using CSS selectors like the following:

::cue(ruby) { font-weight: bold; }
::cue(rt) { font-weight: normal; }

·     Voice span <v>: to mark up a section of text with a voice and speaker annotation, for example,

<v Fred>How are you?</v>

This also allows using CSS selectors like the following, once the cue is included in an HTML page:

::cue(v[voice="Fred"]) { font-style: italic; }

·     Language span <lang>: to mark up a section of text in a specific language, for example,

<lang de>Wie geht es Dir?</lang>

This also allows using CSS selectors like the following, once the cue is included in an HTML page:

::cue(lang[lang="de"]) { font-style: oblique; }
::cue(:lang(ru)) { color: lime; }

·     Timestamps <hh:mm:ss.mss>: to mark up a section of text with timestamps.

The beauty of timestamps is they give you the opportunity to style cues at precise points in time rather than accepting the white text on a black background we have used to this point in the chapter. An example of the use of a timestamp is shown the following code block:

<00:01:00.000><c>Wie </c><00:01:00.200><c>geht </c><00:01:00.400><c>es </c><00:01:00.600><c>Dir? </c><00:01:00.800>

In this example the words “Wie,” “geht,” “es,” and “Dir” will appear onscreen at the times indicated.

This also allows using CSS selectors, as follows, once the cue is included in an HTML page:

::cue(:past) { color: lime; }
::cue(:future) { color: gray; }

You can use timestamps, for example, to mark up karaoke cues or for paint-on captions. What is a “paint-on” caption? Paint-on captions are individual words that are “painted on” the screen. They appear as the individual words that make up the caption single words, appear from left to right, and are usually are verbatim.

Image Note  Until further notice, you need to use <c> tags to enclose the text between timestamps to make CSS selectors take effect (until gets resolved).

The following is a rather interesting demonstration of the use of the tags and CSS, applied to a music video by Tay Zonday. This video, “Chocolate Rain,” went viral on YouTube a few years ago and is now licensed under Creative Commons.

We start with the WebVTT markup as shown in Listing 4-15a.

Listing 4-15a. WebVTT File for “Chocolate Rain”


1 first cue
00:00:10.000 --> 00:00:21.710
<v Tay Zonday>Chocolate Rain</v>

00:00:12.210 --> 00:00:21.710
<b>Some </b><i>stay </i><u>dry </u>and others feel the pain

00:00:15.920 --> 00:00:21.170
<c.brown>Chocolate </c><u>Rain</u>

00:00:18.000 --> 00:00:21.170
<00:00:18.250><c>A </c><00:00:18.500><c>baby </c><00:00:19.000><c>born </c><00:00:19.250><c>will </c><00:00:19.500><c>die </c><00:00:19.750><c>before </c><00:00:20.500><c>the </c><00:00:20.750><c>sin</c>

and apply the appropriate CSS markup to the HTML in Listing 4-15b.

Listing 4-15b. The Cues Are Styled in the HTML page for “Chocolate Rain”

video::cue {color: lime;}
video::cue(#\31\ first\ cue) {background-color: blue;}
video::cue(v[voice="Tay Zonday"]) {color: red !important;}
video::cue(:past) {color: lime;}
video::cue(:future) {color: gray;}
video::cue(c.brown) {color:brown; background-color: white;}

<video poster="img/chocolate_rain.png" controls>
  <source src="video/chocolate_rain.mp4"  type="video/mp4">
  <source src="video/chocolate_rain.webm" type="video/webm">
  <track id="track1" label="English" src="tracks/chocolate_rain.vtt"
         kind="subtitles" srclang="en" default>

As you can see, the cues are created in the VTT document and styled in the HTML using inline CSS and a variety of styles from the cue markup. The styling could just as easily be contained in an external CSS style sheet. The result, as shown in Figure 4-19, is not exactly consistent between the browsers and is something you need to be aware of when undertaking a project of this sort.


Figure 4-19. Rendering of “Chocolate Rain” example in Google Chrome (left) and Safari (right)

Image Note  Google Chrome and Safari currently have the best styling support. Firefox doesn’t support the ::cue pseudo-selector and Internet Explorer already fails at parsing marked up cue text.

Cue Settings

Now that you know how to style cue content, let’s deal with another question you may have: “Do they always have to be at the bottom of the screen?” The short answer is: “No.” You can choose where to place them and that is the subject of this, the final part of the section ”WebVTT: Authoring Subtitles, Captions, Text Descriptions, and Chapter.”

This is accomplished due to WebVTT introducing “cue settings.” These are directives that are added behind the end time specification of the cue on the same line and consist of name-value pairs separated by a colon (:).

We’ll start with vertical cues.

Vertical Cues

Some languages render their script vertically rather than horizontally. This is especially true of many Asian languages. Mongolian, for example, is written vertically with lines added to the right. Most other vertical script is written with lines added to the left, such as traditional Chinese, Japanese, and Korean.

The WebVTT cue setting for vertical cues is as follows:


The first cue setting specifies vertical text growing right to left, and the second cue setting has the text growing left to right.

Listing 4-16 shows an example of Japanese text and Figure 4-20 a rendering in Safari. Note how the <ruby> markup is not yet supported.

Listing 4-16. WebVTT File with Vertical Text Cues


As you can see, in Figure 4-20, the vertical cues are added. There is one small problem. Chrome and Opera currently have rl and lr mixed up and Firefox and Internet Explorer have yet to support vertical rendering. Only Safari gets this right.


Figure 4-20. Rendering of vertical text cues in Safari

Line Positioning

Cue lines are, by default, rendered in the bottom center of the video viewport. However, sometimes WebVTT authors will want to move the text to another position—for example, when burnt-in text is being shown on the video in that location or when most of the action is in that position, as is the case in a soccer match. In those situations you may decide to position the cues at the top of the video viewport or in any other position between the top and bottom of the viewport.

The top of the viewport is the right side for rl and the left side for lr vertical cues with the space between the left and right of the viewport being calculated in much the same manner as horizontal text.

A typical WebVTT cue setting for line positioning looks as follows:


The first version specifies the first line at the top of the video viewport —any successive numbers continuing down from there (e.g., 4 is the fifth line from the top of the viewport). The second one specifies the first line at the bottom of the viewport with decreasing numbers counting up from there (e.g., -5 is the fifth line from the bottom).

You can also specify percentage positioning from the top of the video viewport.


If we assume the video is 720 pixels high, the caption would appear 72 pixels down from the top of the video viewport.

As you have seen, the line cue setting allows you three different ways of positioning a cue top and bottom: counting lines from the top, counting lines from the bottom, and percentage positioning from the top.

Cue Alignment

The text within a cue can be left, middle, or right aligned, within the cue box, through an align setting.


The start and end settings are for the case where the alignment should be with the start/end of the text, independent of the text being of directionality left to right or right to left.

Text Positioning

Sometimes WebVTT authors will want to move the cue box away from the center position. For example, the position chosen covers the speaker’s face. In this case the cue should be moved either to the left, right or below the speaker.

The WebVTT cue setting for text positioning would be:


This aligns a horizontal cue to a potion that is 60% of the distance from the left edge of the video viewport.

Image Note  Be careful with text positioning because the final position of the cue is dependent on the alignment of the text. For example, if a cue is middle aligned, the origin point for the text position will be the center of the text block. For right-aligned text it will be the right edge of the block, and so on. If you get inconsistent results, the first place to look is the cue alignment property.

Cue Size

Being able to change the cue position in the viewport is a good feature, but there is also the risk that the caption may actually be too wide. This is where the size property—always expressed as a percentage—is useful. For example, positioning a cue on the left below a speaker would require you to also limit the cue width as shown here.

position:10% align:left size:40%

To get a better understanding of the effect of all these cue settings, Listing 4-17 shows an example of cues that use the line, align, position, and size settings to accommodate changing cue positions and widths.

Listing 4-17. WebVTT File with Cues with Cue Settings


00:00:08.124 --> 00:00:10.742 line:0 position:10% align:left

00:00:08.124 --> 00:00:10.742 line:50% position:50%
and high end personal computers

00:00:08.124 --> 00:00:10.742 align:right size:10% position:100%
have been able to

The first cue—1a—is rendered in the first line of the video viewport and left aligned at a 10% offset. The second cue—1b—is rendered right in the middle. The third cue—1c—is 10% of the viewport width, rendered right aligned at the right edge.

Figure 4-21 shows the result in Chrome (left) and Safari (right).


Figure 4-21. Rendering of cues with cue settings in Chrome and Safari

Chrome, Opera, and Firefox essentially render the cues in the same manner. Safari’s positioning has a somewhat different interpretation. IE doesn’t support any cue settings.

Other WebVTT Features

To this point we have outlined the most important WebVTT features. We’d also like to mention a few others before we finish.

·     Comments: you can author comments in a WebVTT file—basically, they are a cue without an identifier or a timing line, and the text block starts with “NOTE.”

·     Regions: this is a feature under discussion to allow more detailed positioning, allow providing a background color on the cue, and allow for scrolling text (roll-up captions). It is not clear yet whether browsers will implement this part of the spec.

·     Nested cues: a track of @kind=’chapters’ allows definition of nested cues (i.e., cues that are fully contained in other cues). This is useful for a track, which distinguishes between chapters, sections, subsections, and so on, where each lower hierarchy is fully contained within the higher one. Thus, chapter tracks can be used for navigation at different resolutions, though it is difficult to imagine how that would be rendered in browsers.

To this point we have focused on a single video and shown you how to add transcripts, subtitles, captions, chapters, and text descriptions. As you have discovered, they are all key aspects in making video and audio accessible to various audiences. Still, we have all seen video on TV where, at a news conference, someone is just off to the side using sign language to translate what is being said to deaf users. Thus, there are instances where the video being streamed will also require the use of a separate video of a “signer.” This is where you will need to create a video with multiple synchronized audio and video tracks.

Multiple Audio and Video Tracks: Audio Descriptions and Sign Language Video

We have talked a lot about how to publish text alternatives for video, including transcripts, captions, subtitles, and text descriptions. However, vision-impaired video viewers are used to consuming audio descriptions with their videos, and many deaf users find it easier to read/watch sign language rather than text. Similarly, international users have become accustomed to dubbed audio tracks such as the clear audio tracks reviewed earlier. This presents us with a rather interesting challenge where a video doesn’t have just one video and one audio track but multiple video and audio tracks.

This challenge can be met in two ways. The first is to prepare separate audio and video files that are synchronized to each other. The second method is to produce a single multiplexed video file from which we retrieve the tracks relevant to the particular user. HTML5 provides both options. The first is supported via the MediaController API, the latter via multitrack media files.

Image Note  Media with multiple time-synchronized audio and video tracks is also common in professional video production where scenes are often recorded from multiple angles and with multiple microphones, or the director’s comments may be available.

Multitrack Media

When referencing a video containing multiple audio and video tracks in a <video> element, browsers only display one video track and render all enabled audio tracks. To get access to all the audio and video tracks that are associated with an <audio> or <video> element, the IDL of theMediaElement is extended with the following attributes:

interface HTMLMediaElement : HTMLElement {
  readonly attribute AudioTrackList audioTracks;
  readonly attribute VideoTrackList videoTracks;


The @audioTracks attribute of media elements is an AudioTrackList object that contains the list of audio tracks that are available for the media element together with their activation status.


The @videoTracks attribute of media elements is a VideoTrackList object that contains the list of video tracks that are available for the media element together with their activation status.

Audio and Video Tracks

The AudioTrackList object and the AudioTrack objects contained therein are defined as follows:

interface AudioTrackList : EventTarget {
   readonly attribute unsigned long length;
   getter AudioTrack (unsigned long index);
   AudioTrack? getTrackById(DOMString id);
            attribute EventHandler onchange;
            attribute EventHandler onaddtrack;
            attribute EventHandler onremovetrack;

interface AudioTrack {
   readonly attribute DOMString id;
   readonly attribute DOMString kind;
   readonly attribute DOMString label;
   readonly attribute DOMString language;
            attribute boolean enabled;

The VideoTrackList object and its VideoTrack objects are very similar:

interface VideoTrackList : EventTarget {
   readonly attribute unsigned long length;
   getter VideoTrack (unsigned long index);
   VideoTrack? getTrackById(DOMString id);
   readonly attribute long selectedIndex;
            attribute EventHandler onchange;
            attribute EventHandler onaddtrack;
            attribute EventHandler onremovetrack;

interface VideoTrack {
   readonly attribute DOMString id;
   readonly attribute DOMString kind;
   readonly attribute DOMString label;
   readonly attribute DOMString language;
            attribute boolean selected;

An AudioTrackList is a list of AudioTrack objects. The list’s length is provided by the @length attribute. Individual tracks can be accessed by their index number (e.g., track[i]), and the getTrackById() method allows retrieving an AudioTrack by providing its id string. In addition, a change event is raised whenever one or more tracks in the list have ben enabled or disabled, an addtrack event is raised whenever a track has been added to the track list, and a removetrack event whenever a track has been removed.

The VideoTrackList is identical, only applied to VideoTrack objects. It has one additional attribute: @selectedIndex, which specifies which track in the list is selected and rendered when used in a <video> element.

Both the AudioTrack and VideoTrack objects consist of the following attributes:

·     @id: an optional identifier string,

·     @kind: an optional category of the track,

·     @label: an optional human-readable string with a brief description of the content of the track,

·     @language: an optional IETF Language code according to BCP47 specifying the language used in the track, which could be a sign language code.

The AudioTrack object also has an @enabled attribute used to turn the audio track on or off. This, incidentally, fires an onchange event at the list containing the AudioTrack.

The VideoTrack object additionally has an @selected attribute through which a video track can be turned on When the video track is turned on it automatically turns off any other video tracks in the VideoTrackList and fires a onchange event at that list.

The following @kind values are defined for audio tracks:

·     "main": the primary audio track,

·     "alternative": an alternative version to the main audio track (e.g., a clean audio version),

·     "descriptions": audio descriptions for the main video track,

·     "main-desc": the primary audio track mixed with audio descriptions,

·     "translation": a dubbed version of the main audio track,

·     "commentary": a director’s commentary on the main video and audio track.

The following @kind values are defined for video tracks:

·     "main": the primary video track,

·     "alternative": an alternative version to the main video track (e.g., a different angle),

·     "captions": the main video track with burnt-in captions,

·     "subtitles": the main video track with burnt-in subtitles,

·     "sign": a sign language interpretation of the main audio track,

·     "commentary": a director’s commentary on the main video and audio track.

Creating Multitrack Media Files

You can use MP4Box ( to author multitrack MPEG-4 files. Following is an example:

$ MP4Box -new ElephantDreams.mux.mp4 -add ElephantDreams.mp4 \
         -add ElephantDreams.sasl.mp4 -add ElephantDreams.audesc.mp3

This command adds the ElephantDrams.sasl.mp4 and ElephantDreams.audesc.mp3 files to the ElephantDrams.mp4 file thus creating both a SASL (South African Sign Language) and an audio description track.

To check that it all worked you could use

$ MP4Box -info ElephantDreams.mux.mp4

. . . to confirm the mux file has four tracks.

For WebM files, you would use mkvmerge (see also for a GUI application). Following is an example:

$ mkvmerge -w -o ElephantDreams.mux.webm ElephantDreams.webm \
  ElephantDreams.sasl.webm ElephantDreams.audesc.ogg

This command adds the ElephantDreams.sasl.webm sign language file and the ElephantDreams.audesc.ogg audio description file to ElephantDreams.webm.

To check that it all worked you could use

$ mkvinfo ElephantDreams.mux.webm

which confirms the mux file has four tracks.

You can play back these files in VLC—it will show both video tracks and synchronize them. Unfortunately, VLC allows only one audio track to be active at one time, so you can only listen to the main audio track or the audio description track.

Now let’s put it all together in an example the HTML file, shown in Listing 4-18.

Listing 4-18. Inspection of Multitrack Video Files

<video poster="img/ElephantDreams.png" controls width="50%">
  <source src="video/ElephantDreams.mux.webm" type="video/webm">
  <source src="video/ElephantDreams.mux.mp4"  type="video/mp4">
<h3>Attribute values:</h3>
<p id="values"></p>
  var video = document.getElementsByTagName("video")[0];
  var values = document.getElementById(’values’);

  function start() {
    if (video.videoTracks) {
      values.innerHTML += "videoTracks.length: "
                       + video.videoTracks.length + "<br/>";
      values.innerHTML += "audioTracks.length: "
                       + video.audioTracks.length;
    } else {
      values.innerHTML += "Browser does not support multitrack audio and video.";
  video.addEventListener("play", start, false);;

We are trying to extract the content of the @videoTracks and @audioTracks attributes in Listing 4-19, so we may be able to manipulate which audio or video track is active. However, Figure 4-22 shows that we’re not very lucky with that—Safari only ever shows 0 video and audio tracks.


Figure 4-22. Rendering of @videoTracks and @audioTracks in Safari

Unfortunately, the other browsers are worse and don’t even support that attribute. For now it is not suggested that you try to use multitrack media resources in HTML5. Browsers have mostly decided that multitrack resources are not a good way to deal with multiple synchronized audio and video tracks, because it incurs the cost of having to transmit audio and video tracks out of which only a small number is ever rendered to the user.

The preferred approach today is to use the new MediaSource extensions to deliver multitrack media resources. With MediaSource extensions, a manifest file is transmitted at the start of media playback, which describes which tracks are available for a resource. Then, only data from those tracks that are actually activated by the user will be transmitted. MediaSource extensions are outside the scope of this book.

The HTML5 specification provides another approach to the synchronization of separate media files with each other, which we will explore next.

MediaController: Synchronizing Independent Media Elements

MediaController is an object that coordinates the playback of multiple media elements such as synchronizing a sign-language video to the main video. Every media element can be attached—or slaved—to a MediaController. When that happens, the MediaController modifies the playback rate and the volume of each of the media elements slaved to it, and ensures, when any of the media it controls stall, that the others are stopped at the same time. One other point to keep in mind is that when the MediaController is used, looping is disabled.

By default a media element has no MediaController. Thus a MediaController has to be created declaratively using the @mediagroup attribute or by explicitly setting a controller attribute of the IDL of the MediaElement:

interface HTMLMediaElement : HTMLElement {
            attribute DOMString mediaGroup;
            attribute MediaController? controller;


The mediaGroup IDL attribute reflects the value of the @mediagroup content attribute. The @mediagroup attribute contains a string value. We can pick the name of the string at random—it just has to be the same between the media elements that we are trying to synchronize. All media elements that have a @mediagroup attribute with the same string value are slaved to the same MediaController.

Listing 4-19 shows an example of how all of this works.

Listing 4-19. Slaving a Main Video and a Sign Language Video Together

<video poster="img/ElephantDreams.png" controls width="50%" mediagroup="sync">
  <source src="video/ElephantDreams.webm" type="video/webm">
  <source src="video/ElephantDreams.mp4"  type="video/mp4">
<video poster="img/ElephantDreams.sasl" width="35%" mediagroup="sync">
  <source src="video/ElephantDreams.sasl.webm" type="video/webm">
  <source src="video/ElephantDreams.sasl.mp4"  type="video/mp4">
<h3>Attribute values:</h3>
<p id="values"></p>
  var video1 = document.getElementsByTagName("video")[0];
  var video2 = document.getElementsByTagName("video")[1];
  var values = document.getElementById(’values’);

  function start() {
    setTimeout(function() {
      values.innerHTML += "Video1: duration=" + video1.duration + "<br/>";
      values.innerHTML += "Video2: duration=" + video2.duration + "<br/>";
      values.innerHTML += "MediaGroup: " + video1.mediaGroup + "<br/>";
      values.innerHTML += "MediaController: duration="
                       + video1.controller.duration + "<br/>";
      values.innerHTML += "MediaController: paused="
                       + video1.controller.muted + "<br/>";
      values.innerHTML += "MediaController: currentTime="
                       + video1.controller.currentTime;
    }, 10000);
  video1.addEventListener("play", start, false);;

We’re synchronizing two video elements—one with ElephantDreams and one with a SASL signer for that same video—together using @mediagroup="sync". In the JavaScript, we let the videos play for 8 seconds and then display the value of the videos’ durations in comparison to their controller’s duration. You’ll notice in the rendering in Figure 4-23 that the controller’s duration is the maximum of its slaved media elements. We also print the controller’s paused, muted, and currentTime IDL attribute values.


Figure 4-23. Rendering of slaved media elements in Safari

Note that Safari is the only browser supporting the @mediagroup attribute and MediaController at this point in time.


The MediaController object contains the following attributes:

enum MediaControllerPlaybackState { "waiting", "playing", "ended" }; [Constructor] interface MediaController : EventTarget {
   readonly attribute unsigned short readyState;
   readonly attribute TimeRanges buffered;
   readonly attribute TimeRanges seekable;
   readonly attribute unrestricted double duration;
            attribute double currentTime;
   readonly attribute boolean paused;
   readonly attribute MediaControllerPlaybackState playbackState;
   readonly attribute TimeRanges played;
   void pause();
   void unpause();
   void play();
            attribute double defaultPlaybackRate;
            attribute double playbackRate;
            attribute double volume;
            attribute boolean muted;

The states and attributes of the MediaContoller represent the accumulated states of its slaved media elements. The readyState and playbackState are the lowest value of all slaved media elements. The buffered, seekable, and played TimeRanges are sets and represent the intersection of the same respective attributes on the slaved media elements. Duration is the maximum duration of all slaved media elements. CurrentTime, paused, defaultPlaybackRate, playbackRate, volume, and muted are imposed on all of theMediaController’s slaved media elements to keep them all synchronized.

A MediaController also fires the following events, which are somewhat similar to the ones found on the MediaElement:

·     Emptied: raised when either all slaved media elements have ended or there are no longer any slaved media elements.

·     loadedmetadata: raised when all slaved media elements have reached at least the HAVE_METADATA readyState.

·     loadeddata: raised when all slaved media elements have reached at least the HAVE_CURRENT_DATA readyState.

·     canplay: raised when all slaved media elements have reached at least the HAVE_FUTURE_DATA readyState.

·     canplaythrough: raised when all slaved media elements have reached at least the HAVE_ENOUGH_DATA readyState.

·     playing: raised when all slaved media elements are newly playing.

·     ended: raised when all slaved media elements are newly ended.

·     waiting: raised when at least one slaved media element is newly waiting.

·     durationchange: raised when the duration of any slaved media element changes.

·     timeupdate: raised when the MediaController’s currentTime changes.

·     play: raised when the MediaController’s paused state changes.

·     pause: raised when all media elements move to paused.

·     ratechange: raised when the defaultPlaybackRate or playbackRate of the MediaController are newly changed.

·     volumechange: raised when the volume or muted attributes of the MediaController are newly changed.

Listing 4-20 shows an example of a script-created MediaController.

Listing 4-20. Slaving a Main Video and an Audio Description Together Using MediaController

<video poster="img/ElephantDreams.png" controls width="50%">
  <source src="video/ElephantDreams.webm" type="video/webm">
  <source src="video/ElephantDreams.mp4"  type="video/mp4">
<h3>Attribute values:</h3>
<p id="values"></p>
var values = document.getElementById(’values’);
var video = document.getElementsByTagName("video")[0];
video.volume = 0.1;
var audio = new Audio();
if (audio.canPlayType(’audio/mp3’) == "maybe" ||
    audio.canPlayType(’audio/mp3’) == "probably") {
  audio.src = "video/ElephantDreams.audesc.mp3";
} else {
  audio.src = "video/ElephantDreams.audesc.ogg";
audio.volume = 1.0;

var controller = new MediaController();
video.controller = controller;
audio.controller = controller;;

controller.addEventListener("timeupdate", function() {
  if (controller.currentTime > 30) {
    values.innerHTML += "MediaController: volume=" + controller.volume;
    values.innerHTML += "MediaController: audio.volume=" + audio.volume;
    values.innerHTML += "MediaController: video.volume=" + video.volume;
    values.innerHTML += "MediaController: currentTime="
                        + controller.currentTime;
    values.innerHTML += "MediaController: audio.currentTime="
                        + audio.currentTime;
    values.innerHTML += "MediaController: video.currentTime="
                        + video.currentTime;
}, false);

The MediaController synchronizes an audio description with a main video, plays for about 30 seconds, and then displays some IDL attribute values.

Notice, in particular, how we have decided to set the volume of the audio and video objects before slaving them together. By doing this we accommodate the different recording volumes of the resources. Had the volume been set through the MediaController, theMediaController would force its volume onto all its slaved elements.

Figure 4-24 shows the result.


Figure 4-24. MediaController object and its slaved elements in Safari

Notice the difference in playback position shown for all of the slaved audio and video media elements.

Since Safari is the only browser currently supporting @mediaGroup and MediaController, you will have to use JavaScript to gain the same functionality with the other browsers. Be careful when doing so, because it’s not just a matter of starting playback of two videos at the same time to keep them in sync. They will decode at a different rate and will eventually drift apart. Frequent resynchronization of their timelines is necessary.

Navigation: Accessing Content

As described earlier, providing alt content alone is not sufficient to satisfy all accessibility needs.

A key challenge in creating video for visually impaired users is how to make navigating through the video accessible. The media controls in the browsers contain a timeline navigation bar that seeing users use to click and directly jump to time offsets. This avoids having to watch and wait for a certain piece of interest to come up. The problem is that vision-impaired users cannot see the navigation bar.

In Chapter 2 we discussed the functionality of the default player interfaces and how browsers have made them keyboard accessible. Features like Opera’s CTRL-left/right arrow navigation by 1/10 of the video duration, or Firefox’s left/right arrow navigation using 10-second increments gives vision-impaired users a means to more easily navigate.

What is missing, though, is semantic navigation, which is the ability to directly jump to points of interest, for example, in long-form media files. Most content is structured. This book—just look at the chapters and sections that make up this book’s structure—is an example of structured content. Similarly, long-form media files will have a structure. For example, movies on DVDs or Blu-Ray come with chapters that allow direct access to meaningful time offsets. In fact, there is a whole web site dedicated to this subject at

We have already seen, earlier in this chapter, how the <track> element can expose text tracks of kind="chapters" and how we can author WebVTT files to provide those chapters. How can we make use of chapters and time offsets for semantic navigation that is also accessible to vision-impaired users?

Listing 4-21 provides an example that uses media fragment URIs to navigate chapters that have been provided via a WebVTT file.

Listing 4-21. Navigating Chapters Using Media Fragment URIs

<video poster="img/ElephantDreams.png" controls width="50%">
  <source src="video/ElephantDreams.webm" type="video/webm">
  <source src="video/ElephantDreams.mp4"  type="video/mp4">
  <track src="tracks/ElephantDreams_chapters_en.vtt" srclang="en"
         kind="chapters" default>
<h3>Navigate through the following chapters:</h3>
<ul id="chapters">
var video = document.getElementsByTagName("video")[0];
var source;
var chapters = document.getElementById(’chapters’);

function showChapters() {
  source = video.currentSrc;
  var cues = video.textTracks[0].cues;
  for (var i=0; i<cues.length; i++) {
    var li = document.createElement("li");
    var link = document.createElement("a");
    link.href = "#t=" + cues[i].startTime + "," + cues[i].endTime;
    var cue = cues[i].getCueAsHTML();
    cue.textContent = parseInt(cues[i].startTime) + " sec : "
                    + cue.textContent;
  video.removeEventListener("loadeddata", showChapters, false);
video.addEventListener("loadeddata", showChapters, false);

function updateFragment() {
  video.src = source + window.location.hash;
window.addEventListener("hashchange", updateFragment, false);

After the video has been loaded, we run the showChapters() function to go through the list of chapter cues in the .vtt file and add them to a <ul> list below the video. The list is using the start and end time of the cues to build media fragments: #t=[starttime],[endtime]. These media fragments are provided as URLs for each respective chapter: link.href = "#t=" + cues[i].startTime + "," + cues[i].endTime;.

As the link is being activated, the web page’s URL hash changes and activates the updateFragment() function, in which we change the URL to the video element to contain the media fragment: video.src = source + window.location.hash;. Then we reload the video and play it, which activates the change to the video URL and thus navigates the video.

The result can be seen in Figure 4-25 after navigating to the “Emo Creates” chapter.


Figure 4-25. Navigating chapters using media fragment URIs in Google Chrome

Since the anchor element <a> by definition provides keyboard focus for vision-impaired users, this will allow them to navigate the video’s chapters. Such direct semantic navigation is actually also really useful to other users, so it’s a win-win situation.


This has been a rather long chapter and we will bet you never considered the fact that there is so much to making media available to people who are either disabled or simply don’t speak your language.

The most important lesson in this chapter is that accessibility is not a simple topic. This is simply due to the fact there are so many differing accessibility needs. The easiest way to satisfy them is by providing textual transcripts. The problem is, transcripts provide the worst user experience and should only be used as a fallback mechanism.

The most important accessibility and internationalization needs for media are the following:

·     Captions or sign language video for hearing-impaired users,

·     Audio or text descriptions for vision-impaired users, and

·     Subtitles or dubbed audio tracks for international users.

You’ve seen how to use WebVTT for authoring text tracks that provide for all of the text-based accessibility needs. You’ve also seen how to use multitrack media or a MediaController to deal with sign language video, audio descriptions, or dubbed audio tracks.

As HTML5 matures and the browser manufacturers continue to keep pace, many of the limited features presented in this chapter will become commonplace.

Speaking of commonplace, smartphones and devices have rapidly moved from novelty to commonplace in a little less than five years. Along with this HTML5 video has blossomed into an interactive and creative medium thanks to the HTML5 Canvas. It is the subject of the next chapter. We’ll see you there.