Project 4:

A Screen Reader for the Web

Due:  Thursday, October 5th  *in class*

In this project, you will build a screen reader for the web. Your screen reader will have basic functions, such as being able to read straight through the content of a page, and skip forward or backward between some element types.

The web makes this much easier than it would seem at first, but, as always, I encourage you to start this project early!

Resources

Videos:

Jeff Building Basic Functions of Screen Reader in Class

Code:

Code from Screen Reader functions Jeff Built In Class


Another Chrome Extension

The basis for this project will be a Chrome extension. To get started, copy and paste the extension that you built last week into a new directory. Enter its manifest to have a different name. I suggest something like, “Screen Reader,” and then load it into the browser.

Reading the Contents of a Web Page

Reading the contents of a web page is very similar to reading the contents of a single element that you worked on before. The main difference is that you’re not just reading one element, but multiple elements in sequence. At some high level what you’ll do to make this work is figure out how to go through the contents of the page in linear order, and then just read them out as you go.

The Document Object Model (DOM) is a tree version of the web page. Reading the page means to visit all of the elements in the tree, seeing if each one should be read out loud or not, and then doing that (or not). Technically, this is called traversing the tree.

The last decision to make is what order to visit the elements in the tree. We could do it randomly, but then the page contents would be presented randomly. In computer science, we talk about trees a lot because they’re a handy data structure for all sorts of purposes. HTML (or any variant of XML) is naturally converted into a tree structure, where tags are nested inside other tags. The diagram below shows one of these trees.

Consider the diagram above. On this page, you probably want to read the contents of the H1 first. To get to the text contents of the H1 tag, you have to go from the HTML tag, to the BODY tag, to the H1 tag, and all the way down to the text that is at the end (leaf).

After you read that, you’d go back up to the H1. Since there’s nothing else there, you’d go back to the BODY tag, find its next child (the DIV tag), visit its first child, and then read the text contained within it.  Then you’d go back up to the P, back up to the DIV, visit the DIV’s next child, which is the H2, read the text within it. If you continue like this, you’ll end up reading the whole tree in a linear order that likely matches its presentation.

Technical terms this is a depth-first traversal, which just means you keep drilling down before visiting other children of a node. This is in contrast to a breadth-first traversal in which you would visit each child element before visiting any of those child elements children.

A subtle point is that you were technically doing a pre-order depth-first traversal because you read whatever you were going to read for the node on the way down not on the way back up. First instance, you read “heading level 1” before you read the heading’s contents.

User Interaction

When the user presses the ctrl+j, the extension should start reading from the top of the page. When they press, ctrl+k, the extension should reverse direction. When the user presses the Escape key, it should silence the extension. You can look at this Stackoverflow article on detecting the control key.

Further, when the user presses ctrl+h the extension should jump to and read the next heading, when they press shift+ctrl+h, it should jump to and read the previous heading. If there is no heading in that direction, it should read “no more headings”.

Finally, pressing the
tab key should cause the extension to jump to the next “focusable” element and shift-tab should cause the extension to jump to the previous “focusable” element. The easiest way to do this is probably to figure out which HTML tags are focusable, and look for those. For the purposes of this assignment, you do not need to detect elements that have an assigned tab index or that have otherwise been made focusable programmatically.

Finally, as the user types into a form field (
<input type=”text”> or <textarea>) the extension should read allowed the character that they’re typing.

You will want to think carefully about when to allow sounds to queue up in the synthesizer queue so that they will be eventually read, and when you should explicitly remove items from that queue. For instance, once a user has typed a second key, it probably doesn’t make sense to still read the last key. Once the user has moved on to the next heading, it probably doesn’t make sense to keep reading the last one.

Finding the Next Element to Read

jQuery has some handy functions that allow you to move around the DOM. Some of these include .next() to go to the next sibling, .previous() to go to the prior sibling, and .parent() to go to the parent of the element.


Many elements don’t result in anything being read,
i.e., <p>, <div>, etc.

Video of Jeff demonstrating how to navigate through the DOM.

An alternative is to have jQuery do the heavy lifting, which is what I recommend you do. In this example, we’re able to find all the elements following (or, preceding) a given element, based on this
StackOverflow answer:

var start = '#start';

var all = 'p';

var index;

var afters = $(all).add(start).each(function (i) {

    if ($(this).is(start)var start = '#start';

var all = 'p';

var index;

var afters = $(all).add(start).each(function (i) {

    if ($(this).is(start)) {

        index = i;

        return false; // quit looping early

    }

}).slice(index + 1);) {

        index = i;

        return false; // quit looping early

    }

}).slice(index + 1);


One problem is this is that the performance is likely to be pretty poor on large web sites. To improve performance, you might only perform this expensive operation every once in awhile, but then you run the risk of it not exactly matching the DOM of the live web page. What you have created in practice is an
off-screen model!  Sure, it’s easier to work with than the DOM, but that comes at the price of maybe not exactly matching the underlying content.

Here is the code we developed in class for doing this:
function readFromBeginning() {
 allelements = $("*");
 currentelem_index = 0;

 currentstate = "READING";

 findTheNextOne();
}


function findTheNextOne() {
 do {
   var currentelem = allelements[currentelem_index];
   currentelem_index++;

   if(currentelem_index>100) {
     break;
   }
 } while(!doesItSpeak(currentelem));

 speakMe(currentelem);  
}


This approach works by first generating a list of all the elements in the page (the line with
$(“*”)) and then just moving forward and backward in that array. So, if you want to find the next element that you should read, you just go forward in the array until you find an element that has something to speak. If you want to go backward, you just go backwards through that array until you find something that can speak.

So, for instance, if you save the result of
$(“*”), which is an array of all the elements on the page in the correct order, to a variable, you can then go forward and back through the page by going forward or backward through that array.

var allelements = $("*")
undefined
allelements
(268) [html, head, meta, title, script, script, style, style, body.c19, div, ul, a, a, div.maincontent, p#h.hxvgegbtvmf.c7.title, span.c1.c18, p#h.wqy2hux8v33s.c7.title, span.c1, p.c5, span.c0, p.c16, span.c0, br, p.c16, span, span.c14, a.c11, span.c0, p.c5, span.c0, p.c16, span, span.c14, a.c11, span, p.c5, span.c0, p.c16, span, img, span, img, span, img, h2#h.kjx1b8tjwivx.c17, span.c18.c22, p.c16, span.c15, p.c5, span.c18.c6.c20, p.c5, span.c18.c20.c6, h3#h.yqyl8lch51a8.c9, span.c6, span.c10, p.c2, span.c0, p.c2.c8, span.c0, p.c2, span, span.c14, a.c11, span.c0, p.c2.c8, span.c0, p.c2, span, span.c14, a.c11, span, p.c5, span.c0, p.c5, span.c0, p.c5, span.c0, h3#h.obq06nuegrfa.c9, span.c6, span.c10, p.c2, span.c0, p.c2.c8, span.c0, p.c2, span.c14, a.c11, span.c0, p.c2.c8, span.c0, p.c2, span.c14, a.c11, span, p.c2.c8, span.c0, p.c2, span.c14, a.c11, span.c0, …]
allelements[4]
<script async src=​"https:​/​/​www.google-analytics.com/​analytics.js">​</script>​

Keeping Track of State

While there are almost certainly more elegant ways to keep track of the state of the application, I think the easiest way is to keep track of state like the current node being read and what the application is doing (reading, forward or backward, etc) in a couple of global variables.

That way you’ll know where you are in the page, and you’ll know what you should do next (if anything), once, say, an event fires to say that a sound has stopped playing.

In the code I developed, I decided there were probably five different states that the screen reader could be in:
var currentstate =

"PAUSED";  // READING, READINGBACKWARD, READONEFORWARD, READONEBACK

There’s nothing special about this variable or setting it. It’s just a way for you to keep track of what the screen reader should be doing right now, and to coordinate the various components in your code so they’re doing the right thing given that state.

When to Continue Reading

One of the challenges of this assignment is that you’ll need to write your program so it can operate asynchronously. That is, you’ll do things like wait for a keydown event, and then start reading. Your extension will read one element’s contents, and you’ll only start reading the next

once it’s done.

To do that, you can use the same basic structure you used in the last assignment to generate speech, but you’ll call it slightly differently in order to set an event for when the speech is done playing. When it’s done, you’ll need to tell it to find and then play the next element’s sound.

var u = new SpeechSynthesisUtterance();

u.onend = function(event) {

  alert('Finished in ' + event.elapsedTime + ' seconds.');
}

speechSynthesis.speak(u);

Grading

Main components of our grading will be:

Using the appropriate keyboard commands, does the following happen:
1.  Reads forward through the page

2.  Reads backward through the page

3.  Jumps forward by heading

4.  Jumps backward by heading

5.  Jumps forward by focusable element

6.  Jumps backward by focusable element

7.  Reads characters as they are typed into form fields

8.  Clears the queue of sounds when appropriate so they don’t “back up”


This page and contents are copyright Jeffrey P. Bigham except where noted.