Noise-canceling headphones have gotten very good at creating an auditory blank slate. But allowing certain sounds from a wearer鈥檚 environment through the erasure still challenges researchers. The latest edition of Apple鈥檚 AirPods Pro, for instance, for wearers 鈥 sensing when they鈥檙e in conversation, for instance 鈥 but the user has little control over whom to listen to or when this happens.
A 天美影视传媒 team has developed an artificial intelligence system that lets a user wearing headphones look at a person speaking for three to five seconds to 鈥渆nroll鈥 them. The system, called 鈥淭arget Speech Hearing,鈥 then cancels all other sounds in the environment and plays just the enrolled speaker鈥檚 voice in real time even as the listener moves around in noisy places and no longer faces the speaker.
The team presented May 14 in Honolulu at the ACM CHI Conference on Human Factors in Computing Systems. The is available for others to build on. The system is not commercially available.
鈥淲e tend to think of AI now as web-based chatbots that answer questions,鈥 said senior author , a UW professor in the Paul G. Allen School of Computer Science & Engineering. 鈥淏ut in this project, we develop AI to modify the auditory perception of anyone wearing headphones, given their preferences. With our devices you can now hear a single speaker clearly even if you are in a noisy environment with lots of other people talking.鈥
To use the system, a person wearing off-the-shelf headphones fitted with microphones taps a button while directing their head at someone talking. The sound waves from that speaker鈥檚 voice then should reach the microphones on both sides of the headset simultaneously; there鈥檚 a 16-degree margin of error. The headphones send that signal to an , where the team鈥檚 machine learning software learns the desired speaker鈥檚 vocal patterns. The system latches onto that speaker鈥檚 voice and continues to play it back to the listener, even as the pair moves around. The system鈥檚 ability to focus on the enrolled voice improves as the speaker keeps talking, giving the system more training data.
Related:
- For more information, visit
- Stories from and
The team tested its system on 21 subjects, who rated the clarity of the enrolled speaker鈥檚 voice nearly twice as high as the unfiltered audio on average.
This work builds on the team鈥檚 previous 鈥渟emantic hearing鈥 research, which allowed users to select specific sound classes 鈥 such as birds or voices 鈥 that they wanted to hear and canceled other sounds in the environment.
Currently the TSH system can enroll only one speaker at a time, and it鈥檚 only able to enroll a speaker when there is not another loud voice coming from the same direction as the target speaker鈥檚 voice. If a user isn鈥檛 happy with the sound quality, they can run another enrollment on the speaker to improve the clarity.
The team is working to expand the system to earbuds and hearing aids in the future.
Additional co-authors on the paper were , and , UW doctoral students in the Allen School, and , director of research at AssemblyAI. This research was funded by a Moore Inventor Fellow award, a and a .
For more information, contact tsh@cs.washington.edu.