An interesting entry detailing exactly how Apple’s “Hey, Siri” feature works appeared in the Apple Machine Learning Journal yesterday. It definitely is NOT light reading (especially as late in the evening as I read it), but it is definitely a fascinating look under the hood of a feature than many people use quite often. The article also details possible machine learning techniques that may be used in the future to improve this feature.
As for me, I rarely used “Hey, Siri” when it was released with the iPhone 6, but find myself taking advantage of it much more often since the release of the original Apple Watch. I also use it anytime I fire up the HomePod to listen to music at home, and I do use it with my iPhone when I am in the car, as well. When and where Apple customers choose to use “Hey, Siri” will vary based on personal preferences, but its convenience means that most of us will in some way or another. This makes it both a crucial part of how Siri works, and of how effective we perceive Siri to be.
I think it is interesting both how in-depth the authors went into the way “Hey, Siri” works right now, and how forthcoming Apple was in talking about the challenges that this feature poses. First off, the authors (listed as the Siri Team) give us a detailed and very technical description of how the feature currently works. Anyone who has set up an iPhone in the last three and a half years is familiar with the “Hey, Siri” training process. This process has the user give five responses with slight differences in the trigger phrase. What we know now is that Apple also adds 35 more accepted “Hey, Siri” instances to your voice profile, all of which are stored on the device in question. These records are mathematically \scored based on their quality and accuracy, and subsequent “Hey, Siri” tries must meet an average score based on this pool of results to be accepted.
Photo Source: Apple
In Apple’s privacy-centric model, this method of on-device storage of voice profile data makes sense. This persistent local storage also insures that we don’t have to retrain our device for “Hey, Siri” whenever Siri is updated on the server side. Because the local records remain, they can be used after the update to rebuild our voice profile with any changes to the service.
A disadvantage of the current setup is that Apple’s focus on user privacy means that it is much easier for us to end up with an inconsistent experience with “Hey, Siri.”. Since these voice records aren’t synced between devices, it is possible for one Apple device to perform differently than another one for the same person. This is especially true if one of the devices has a better microphone. As much as I appreciate Apple’s focus on user data privacy, I would also like to see a little more flexibility added to make Siri better.
I also found it interesting how candid the authors of this article are about the challenges that “Hey, Siri” poses. The biggest issues that affect accuracy seem to come from attempts made in environments with lots of background noise, which can vary pretty wildly. I find this to be a problem sometimes while driving a work van on a windy day. A loud outdoor environment. a busy open office area, or a large room with lots of echo can all pose different challenges to an on-board microphone. Based on my reading of the piece, it seems that the current training and profile building model struggles to adapt to all of the potential use environments, which leads to errors.
Those errors are broken down into three main categories- False Accepts, which are instances where the iPhone misinterprets another phrase as “Hey, Siri,” False Rejects, which is when saying “Hey, Siri” isn’t properly understood, or Imposter Accepts, which is when another person is able to trigger “Hey, Siri” on your Apple device. After a discussion about what these errors are and how they occur, the article then goes into great detail about how the “Hey, Siri” profile process can be better handled by using Deep Neural Networks. This should allow Apple to do away with the explicit training process and account for far more usage environments and situations. Our vocal characteristics would simply added into this much wider and more flexible model, rather than relying solely a group of 40 recorded responses sitting on our device.
The authors predict that moving “Hey, Siri” to a machine learning model can cut False Accepts by 50%, False Rejects by 40%, and Imposter Accepts by a whopping 75%. That is a sizable improvement, and it begs the question, why hasn’t this already been done? We’ve heard enough about the lack of direction and past mismanagement of Siri over the last year to have a good guess at that answer. However, the fact that we are seeing all of this information laid out in a very public way this close to WWDC is no accident. I think this article points to work that Apple has been doing to improve its products, especially Siri, using AI and machine learning over the last year. I think that this is an interesting prelude to what we can expect to hear a lot about at the WWDC Keynote in just a month and a half.