Programming Approaches For Speech-Enabled Applications

You can add a speech interface to almost any existing application of yours. Here's how to start.

[Home] [Writing samples] [LinkedIn]

By Robert Delwood

A lead API documentation writer

With the advent and proliferation of more powerful computers, speech technology has become more affordable and accessible. No longer restricted to specialized or esoteric applications, it is possible for speech technology to reach the business and home users in mainstream applications such as word processors, spreadsheets, e-mail packages, and games. However, its introduction as an interface device requires consideration during feature design and software development stages.

Each device has its strengths and weaknesses and those characteristics need to be optimized. For example, take the case of the keyboard and mouse. The keyboard predates the mouse and it is still capable of performing many of the same functions as the mouse, however awkward those functions may be to carry out. The mouse is more adept at certain tasks such as pointing, selecting, or dragging, but is inefficient for textual input. For a while, the two devices were contentious. Over the last two decades they have evolved together and application designers have used each device's unique characteristics to make a better product.

The example of the mouse vs. keyboard is appropriate here. As a new technology, speech recognition has to find its niche and its role in the user interface. Obviously, speech technology is designed for voice input and output and does it very well. When appropriate, designers are encouraged to use speech in their applications. Clearly some uses for speech are better, or at least more obvious, than others. Word processors and e-mail applications can readily take advantage of both dictation and text to speech capabilities. Games may be better suited to use speech recognition for command and control features. In contrast, Web browsers require additional design considerations if speech enabling is contemplated. For instance, Web pages have fields where the user can enter information. However, the fields are often arranged in a visually pleasing but not in a particularly systematic layout. Pages usually have a URL line but they can also have search boxes, comment areas, forms, and check boxes as well as links. Deciding how the user assigns speech to a specific box or area can be awkward. Likewise, the ability to read information from a Web page can be equally awkward for the same reasons. For interfaces to be successful, designers must present and use the interface in a consistent and straightforward way.

Speech Input Modes

There are two basic speech input modes. The first is command and control. This uses speech to issue commands. Typically, the commands are brief sentences or phrases. A good example is using spoken commands as a shortcut to the menu and menu items. The user needs only to say "file open," for instance, to access the open file dialog box. Command and control is the simpler of the two speech input modes in terms of programming. It is simpler, in part, because the range of recognizable words is limited. Using the menu example, the word list may be only as long as the number of menu items the user is allowed to speak. Existing applications may be retrofitted for command and control in a few lines of code and perhaps the addition of a word list.

Dictation is the second speech input mode. In this mode, users may speak freely to the computer, which translates the speech into text. In contrast to command and control, the word list is greatly expanded by definition, to the size of a dictionary. Dictation makes no attempt to recognize the words as commands. Programming for dictation is more complex than for command and control. This applies not only to programming new applications but retrofitting existing ones as well.

Dictation becomes practical in several situations. Users may have a large amount of text to add in one session. Dictation approaches the ideal in a speech system. Users may dictate letters or even write books. Recognition accuracy improves with user experience and voice training with the microphone. The user may not be a proficient typist, and dictation offers an efficient option to enter information. Alternatively, the user may not be able to operate a keyboard or a mouse due to physical limitations such as accessibility issues.

It is possible to combine some of the options above. As an example, a page layout or CAD application is dependent on the mouse to create a box and place it correctly and accurately within the design. The software team may decide to add a speech feature to access a dialog box used to control the dimensions of a box. This would be a command and control function since users would access specific menu items and would use only a few words to do so. Using command and control and the dialog box as an example, the user would speak the command, "dimensions box," the numeric dimensions of the object, and then say, "okay" to accept the box. In this way, speech complements the mouse operations as the user is accomplishing one task (placing and sizing a box) without having to interrupt mouse positioning. The entire operation is completed quicker with speech and users do not have to move the mouse or reposition their hands. This combination, using command and control as well as the mouse, also keeps a consistent user interface. The user is simply accessing the application's existing menu items and using speech as a short cut to them. Adding speech does not introduce any new or hidden features and the user may still perform the task manually. By combining the different input methods, users have a greater ability to concentrate on their task because they spend less time and effort on the mechanics of making the change itself. The application only requires minor code changes to accommodate speech.

Using Speech Interfaces Effectively

For desktop systems, it is important to remember that some tasks are easier with speech but others are not. Early speech applications tried replacing the interface entirely, or at least a large portion, with a voice system. As a result, many of these systems failed because they were too complex or counter-intuitive. Recent applications have been more successful using speech to complement existing interfaces.

Pick the level of speech interaction right for your project. In reality, applications with speech are more like a spectrum. Conventional, keyboard only involvement is at one end of the spectrum and science fiction level involvement at the other end. That is, the captain of the space ship has only to speak the command, and the computer interprets it 100% correctly and instantaneously, without regard to other meanings, confirmation, inflection, accent, or background noise. Perhaps that is the goal of speech recognition in general, but in designing applications, consider the level of involvement needed for your user. For desktop applications, the keyboard is still an inherent part of the computer system. Asking users to enter information from the keyboard is not a new concept. In fact, it is the paradigm speech designers must compete against. Therefore, it may be acceptable to allow users to enter some text and use speech in supplementary roles such as command and control, or navigation. Further along the spectrum, it may be better to reverse the roles and use speech as the primary input method. This reserves the keyboard or mouse to supplement voice operations. If a word is not readily recognized, the user can type in the correct word. Even further down the spectrum, voice is intended as the primary input method. Speech applications intended for automobiles cannot rely on the driver to manually push a button. In the same way, smart phone Internet devices will have no keyboard and will rely exclusively on speech for all aspects of their operations.

Speech often works best when it is integrated with other user interface methods. Speech can complement other input methods; it does not need to compete against them. For example, action games require a quick response and moving a hand from a joystick or keyboard is often detrimental. When appropriate, use speech. Voice commands are useful for some options such as firing weapons but still allow the user to use the keyboard for other operations. In some applications, such as with spreadsheets or chat rooms, speech might allow users to enter textual information quickly, and use the keyboard for navigating through the document or application.

Do not force a fit. If the proposed use of speech is not appropriate, rethink the approach. The user experience requires a logical and intuitive interface. Making a task more complex just to accommodate speech or using speech in cases where is just does not make sense, is bound to confuse the user and detract from the application. This includes forcing a voice equivalent for other input methods. Unless there is a compelling reason, leave out awkward voice interfaces.

Use speech to simplify complex or tedious sequences, not complicate them. Currently, applications must break down tasks into separate steps. Entering this information is generally restricted to one fact or piece of information for each entry. For example, to order airplane tickets, Web sites have separate boxes for each of the departure and arrival cities, date, time, flight, airline and so on. A natural language approach allows users to speak a sentence and the application to parse the information. In the travel example, they can say "I'd like to book a flight from Seattle to Boston at one p.m. on the fourth and come back in the morning of the fifteenth." One sentence conveys all the information.

Speech can also take advantage of information the user intrinsically knows about but that is not presently on the screen. In many cases, the screen only represents a small part of the overall information. For example, Web pages usually have material off the screen and the user knows that there is a "submit" button available but it is just not visible. A speech-enabled application may allow users to say, "submit" rather than having to scroll down the page to the actual button and click it. A mapping application may permit users to say the city name and to center it on the screen. This may prevent the time consuming (and often disorienting) function of manually scrolling around a potentially large area.

If the user does not or cannot use a mouse or keyboard, speech may be the most effective option available. Visually impaired users may not be able to see the screen to scroll, for example. Disabled users may not be able to manually operate a mouse or keyboard. In both cases (and certainly these are not the only ones) speech may be the best, if not the only, method to operate a computer.

Consider the user's environment. For speech recognition to work accurately, the environment must be suitable. A relatively quiet one, such as a business office, is optimal. SAPI 5.0 recognizes background noise and filters it out. Even occasional loud noises will not significantly change the accuracy, although frequent noises will slow the processing rate. Therefore, a perfectly quiet location yields only marginally better recognition results than a normal business one. By contrast, a speech-enabled application in an airport or factory location may yield inferior results. Also, since the user will be speaking aloud, there is an issue of privacy. The user may disturb others nearby, or the information being spoken may confidential.

Adding Speech to Applications

Adding speech to applications is not a difficult task. As mentioned earlier, many applications may be retrofitted for speech; that is, speech may be added to existing packages. These changes need not be extensive, and in some cases, require no modifications to existing code. In general, there are three approaches to adding speech: Without code changes, with code changes, and from the ground up.

The least intrusive method is without code changes. Legacy software incorporates speech without have to change any of the code. This approach takes advantage of external hooks which are already present in the software. These hooks are usually intended for COM, automation, or keyboard interfaces. However, an external application or executable is needed. This executable has the responsibility of handling speech and exporting the features in the appropriate format for the hooks. As an example, the Microsoft SAPI 5.0 SDK demonstrates how to add speech to Age of Empires II (AoE II). In fact, the AoE II program does not need to be modified; expecting users to run a patch would be prohibitive. Rather, the demonstration uses a separate executable: AOESAPI.exe. After handling all the speech and recognition, the output to AoE II is sent with the Win32 call SendInput(), simulating a keystroke. In this way, developers can create speech interfaces for many games and other applications using a keyboard.

Existing applications may also be modified directly to accept speech. This requires changes to the application's code base and therefore is more complex. Before doing this, look at existing commands where speech adds value. This may be as direct as adding speech commands to access menus and menu items. This adds only a small amount of code. Also, the interface remains the same and does not risk confusing the user.

Finally, applications can be created from the ground up. This is the most radical approach but also the more effective for incorporating the newest speech technology. Here, designers attempt radical, or at least vastly different, applications than those that are currently available. In the case of existing paradigms (word processors, for example), designers may be interested in incorporating speech in fundamental or integral ways so that modifying existing code is not an option. New kinds of applications including voice telephony, smart phone Web browsers, hand held computers or other new devices will also require a ground up approach.