Reviewed by Lexie CornerMay 24 2024
An artificial intelligence system developed by a team at the University of Washington allows users wearing headphones to look at a person talking for three to five seconds to "enroll" them. Known as “Target Speech Hearing,” the system enables the listener to move around in noisy areas without facing the speaker. It cancels out all other surrounding sounds and plays only the enrolled speaker's voice in real-time.
Noise-canceling headphones have mastered the ability to create an auditory blank slate. However, researchers are still having difficulty getting some sounds from the wearer's surroundings to pass through the erasure. For example, the most recent version of Apple's AirPods Pro automatically modifies sound levels for users based on certain factors, such as when they are in a conversation. However, the user has limited control over what sounds to be played or when this occurs.
The team presented its findings at the ACM CHI Conference on Human Factors in Computing Systems on May 14 in Honolulu. Others are free to expand upon the proof-of-concept device's code. The system is not available commercially.
We tend to think of AI now as web-based chatbots that answer questions. But in this project, we develop AI to modify the auditory perception of anyone wearing headphones, given their preferences. With our devices, you can now hear a single speaker clearly even if you are in a noisy environment with lots of other people talking.
Shyam Gollakota, Senior Author and Professor, Paul G. Allen School of Computer Science & Engineering, University of Washington
A user points their head toward a speaker while wearing store-bought headphones equipped with microphones and presses a button to activate the system. The speaker's voice should then be simultaneously picked up by the microphones on either side of the headset, with a 16 º error margin. The team's machine learning software uses the signal from the headphones to identify the vocal patterns of the desired speaker on an embedded computer mounted on the vehicle.
Even as the two move around, the system picks up that speaker's voice and keeps playing it back to the audience. As the speaker continues speaking, the system's capacity to concentrate on the enrolled voice increases, providing the system with additional training data.
The team tested the system on 21 subjects, who, on average, rated the enrolled speaker's voice clarity almost twice as high as the unfiltered audio.
This work builds on the team’s previous “semantic hearing” research, which enabled users to select specific sound classes, such as birds or voices, to hear while canceling out other environmental sounds.
The TSH system can currently enroll only one speaker at a time, and it can only do so when the target speaker's voice is not being joined by another loud voice coming from the same direction. If a user is not satisfied with the sound quality, they can run a second enrollment on the speaker to increase clarity.
In the future, the team hopes to expand the system to include earbuds and hearing aids.
Takuya Yoshioka, Director of Research at AssemblyAI, and Doctoral Students Bandhav Veluri, Malek Itani, and Tuochao Chen from the University of Washington's Allen School also contributed to the paper as co-authors. This study was funded by the Thomas J. Cabel Endow Professorship, the Moore Inventor Fellow award, and the UW CoMotion Innovation Gap Fund.
AI headphones filter out noise so you hear one voice in a crowd
AI headphones filter out noise so you hear one voice in a crowd. Video Credit: University of Washington