Siri Personal Assistant

Justin Lee
June 11, 2012

Submitted as coursework for PH250, Stanford University, Spring 2012

The Vision

Siri is a personal assistant application for the iPhone OS. Having enjoyed mass popularity due to its incorporation into the iPhone 4S, which was released in October of 2011, Siri uses natural language processing and components of artificial intelligence to assist the phone owner by answering questions and performing actions either by using tools available on the device or by evoking the aid of a preexisting web service. Founded in December 2007, the application was developed by, and named after, the Stanford Research Institute (SRI) based in Menlo Park. Apple, a company that had long envisioned a human-like talking and competent virtual assistant integrated into their mobile operating system, acquired the Siri application in April of 2010 for a speculated $200 million. [1]

How Siri Works

Siri's method of service delivery is similar to many other speech recognition applications. The first step is to translate the spoken word into recognizable text that can be parsed. The second step is to derive meaning from the words that are identified in the text.

First, the user speaks to the smart phone through the voice transmitter, which collects and digitizes the speech. Next, Siri sends the digitized data to a remote speech recognition server via the Internet; note that Siri does not function without Internet access. The remote server processes the speech against a statistical model to estimate, based on the sounds spoken and the order of said sounds, what letters most probably constitute the digitized speech. The voice recognition works by sending speech into hardware processors that break down the audio sound waves into "most likely words" using a stared database of previously digested speech and their text. [2] Simultaneously, the digitized speech is compared locally to a simplified version of the statistical model. Both the server and the local recognizer determine what text configuration is most likely linked to the digitized speech that the user offered.

Second, Siri then strives to make meaning from the text that is generated. As soon as the result is available, the server returns Siri a text version of the digitized speech, which is displayed for the user to see. From there, Siri creates a list of the most probable commands or actions that would be correlated with keywords that are identified int he text. In other words, Siri has a premade list of what actions or queries are most likely intended given a series of keywords. For example, if the words "tell" and "William" were found in the transcription, there would be a high association with either the creation of an email or, more likely, a text message to a person or contact called "William". Siri would then verify the existence of "William" and form the rest of the contents of the speech command transcription into the body of a text message. If at any point, Siri is not confident as to the user's intent above a certain bar of probability, it would ask the user for clarification. [3]

The Siri Personality

Technologically, Siri is similar to other preexisting language processing software, such as Dragon Go by Nuance or Watson by IBM. One aspect that distinguishes it that Siri is successful at integrating with a wide array of applications, both from native iOS applications and external Web applications. But perhaps more stirring than the technology is the attempt to pass Siri off as something more than a machine - perhaps something believably human.

Not only does Siri answer practical questions, but she also has a believable human personality, which is a characteristic her colleagues lack. A large component of the humanness is humor. For instance, Yael Baker, a public relations and media consultant in New York, says that Siri allowed her to dictate text messages while driving and reminded her not to leave the house without keys or coffee. In response to Baker's daily marriage proposals, Siri dryly replies, "That's sweet but let's just be friends" and "Thanks, Yael, but I'm just here to serve you." Martin Linstrom, a branding consultant, suggests that humans are good at identifying the potential for human dimensions in anything so as to bond with it. [4] Siri's ability to lay the foundation for emotional ties is a significant advantage for why this software might succeed in being communally accepted as a friendly operating system interface where other applications have not had success.

First Steps Into a Larger World

Effective and persuasive natural language processing and speech recognition is one of the elusive and timeless holy grails of computer science technology. While Siri is far from the 100% seamless, all-powerful interface we might expect from a science fiction film, Siri represents a tremendous leap forward in this hitherto generally unsuccessfully tapped field. By demonstrating the practical applications of a speech recognition system that is well integrated with Apple's iPhone user interface, Siri is both a useful and persuasive advancement of natural language processing and artificial intelligence technology.

© Justin Lee. The author grants permission to copy, distribute and display this work in unaltered form, with attribution to the author, for noncommercial purposes only. All other rights, including commercial rights, are reserved to the author.

References

[1] B. Miser, Using iPhone's Siri Voice Command (Que, 2011).

[2] N. Singer, "The Human Voice, as Game Changer," New York Times, 31 Mar 12.

[3] R. Pieraccini and L. Rabiner, The Voice in the Machine: Building Computers That Understand Speech" (MIT Press, 2012), pp. 285-287.

[4] A. Considine, "Now Your Phone Talks Back and Humors You," New York Times. 4 November 2011.