Devices incorporating some form of speech recognition software continue to creep their way into more and more aspects of our lives. From corporate call lines to ATMs, cars, and smartphones, this technology continues to grow and improve. However, while speech recognition devices have come a long way since their advent in the early 1950s, many users remain frustrated with their inaccuracies. [1] Even Apple's Siri seems to frustrate more users than it pleases. So what is it that makes speech recognition so difficult for computers and how does it work in the first place? Unsurprisingly, speech recognition is an incredibly complicated process and for this reason, only a few of the fundamentals and challenges will be touched on in this paper.
The speech recognition process begins with the spoken sound waves being converted to electrical signals via a microphone. An analog-to-digital converter then digitizes the analog signal in a series of 1s and 0s. This stream of data is then filtered to reject the irrelevant components of the signal and normalized to account for variations in volume. The signal is then divided into multiple signals each containing specific frequency bands that are known to contain information of interest. These frequency bands are then compared with stored templates in an attempt to find a match to a specific phoneme (i.e. elemental unit of spoken words). Once a match is found, the difficult process of word assembly begins.
The dominant approach used in today's speech recognition software involves comparing the spoken words or portions of words with elaborate statistical models developed from massive amounts of previously recorded speech data. One of the most common models used today is the Hidden Markov Model. [1] In this model a word or phrase is represented by a chain with many branches. Each link in the chain is a phoneme that is assigned a probability based off the stored speech database and/or user training exercises. The software determines the path through the chain (i.e. what words were spoken) by comparing the stream of digitized phonemes with the statistical model. [1]
While the process described above may sound "simple enough," many serious obstacles must be overcome in this process. A few of these obstacles are described below.
In spoken language it is no secret that so much of what's said is actually not said at all. It therefore follows that accurately transcribing a spoken phrase requires much more knowledge than could ever be extracted from the acoustic waves alone. Even possessing an all encompassing dictionary of the language in use remains insufficient for a computer to truly understand what words are contained in an audio signal. So how do they do it? The answer lies in the context of the spoken word. Without context a computer cannot solve the homonym problem (i.e. determine if the speaker meant "there" or "their"). While this problem has been solved in modern systems, it remains a remarkable feat as this requires an understanding of how every word in every sentence can possibly relate to the other words in that sentence. [1]
One of the most daunting problems facing speech recognition software engineers is the fact that everyone says words a little bit differently. In fact, significant differences in a single spoken word even exist instance to instance for a given speaker. Even when transforming recordings of spoken words to the ever so powerful frequency domain the "noise" persists. While this may not be much of a surprise, quantifying these variations is quite a challenge. However, if our auditory system has no trouble overcoming these differences, then surely some similarities must persist in these signals. Identifying these similarities, known as invariants, is one key to success for any speech recognition system. While the frequency spectrum of a spoken word is not unique, certain resonant frequencies do always appear for a given phoneme. These frequencies are often short lived and may only last for 10s of milliseconds, however, the presence of them and their relationship to subsequent "notable" frequencies can be used to assemble this auditory jigsaw puzzle into words and phrases. [1]
Speech recognition software has become an increasingly popular feature in a variety of devices ranging from cars to smartphones. This software relies on a variety of complicated signal processing techniques and elaborate statistical models to convert sounds into words. While the accuracy of these statistical models is not 100% and is criticized heavily by some users, the success rate of speech recognition software continues to improve and thus, systems incorporating this technology will only become more prevalent.
© Chris Goldenstein. The author grants permission to copy, distribute and display this work in unaltered form, with attribution to the author, for noncommercial purposes only. All other rights, including commercial rights, are reserved to the author.
[1] R. Kurzweil, "When Will HAL Understand What We Are Saying?, in HAL's Legacy: 2001's Computer as Dream and Reality, ed. by D. G. Stork (MIT Press, 1998).