天美影视传媒

Skip to content
A man wearing a surgical mask and headphones walks through the 天美影视传媒 campus while holding a smartphone. People walk behind him.
A team led by researchers at the 天美影视传媒 has developed deep-learning algorithms that let users pick which sounds filter through their headphones in real time. Pictured is co-author Malek Itani demonstrating the system. Photo: 天美影视传媒

Most anyone who鈥檚 used noise-canceling headphones knows that hearing the right noise at the right time can be vital. Someone might want to erase car horns when working indoors, but not when walking along busy streets. Yet people can鈥檛 choose what sounds their headphones cancel.

Now, a team led by researchers at the 天美影视传媒 has developed deep-learning algorithms that let users pick which sounds filter through their headphones in real time. The team is calling the system 鈥渟emantic hearing.鈥 Headphones stream captured audio to a connected smartphone, which cancels all environmental sounds. Either through voice commands or a smartphone app, headphone wearers can select which sounds they want to include from 20 classes, such as sirens, baby cries, speech, vacuum cleaners and bird chirps. Only the selected sounds will be played through the headphones.

The team presented Nov. 1 at in San Francisco. In the future, the researchers plan to release a commercial version of the system.

鈥淯nderstanding what a bird sounds like and extracting it from all other sounds in an environment requires real-time intelligence that today鈥檚 noise canceling headphones haven鈥檛 achieved,鈥 said senior author , a UW professor in the Paul G. Allen School of Computer Science & Engineering. 鈥淭he challenge is that the sounds headphone wearers hear need to sync with their visual senses. You can鈥檛 be hearing someone鈥檚 voice two seconds after they talk to you. This means the neural algorithms must process sounds in under a hundredth of a second.鈥

Because of this time crunch, the semantic hearing system must process sounds on a device such as a connected smartphone, instead of on more robust cloud servers. Additionally, because sounds from different directions arrive in people鈥檚 ears at different times, the system must preserve these delays and other spatial cues so people can still meaningfully perceive sounds in their environment.

Related:

  • For more information, see
  • Story from

Tested in environments such as offices, streets and parks, the system was able to extract sirens, bird chirps, alarms and other target sounds, while removing all other real-world noise. When 22 participants rated the system鈥檚 audio output for the target sound, they said that on average the quality improved compared to the original recording.

In some cases, the system struggled to distinguish between sounds that share many properties, such as vocal music and human speech. The researchers note that training the models on more real-world data might improve these outcomes.

Additional co-authors on the paper were and , both UW doctoral students in the Allen School; , who completed this research as a doctoral student in the Allen School and is now at Carnegie Mellon University; and , director of research at AssemblyAI.

For more information, contact semantichearing@cs.washington.edu.