
天美影视传媒 researchers developed the first system that incorporates tiny cameras in off-the-shelf wireless earbuds to allow users to talk with an AI model about the scene in front of them. For instance, a user might turn to a Korean food package and say, 鈥淗ey Vue, translate this for me.鈥 They鈥檇 then hear an AI voice say, 鈥淭he visible text translates to 鈥楥old Noodles鈥 in English.鈥
The prototype system called VueBuds takes low-resolution, black-and-white images, which it transmits over Bluetooth to a phone or other nearby device. A small artificial intelligence model on the device then answers questions about the images within around a second. For privacy, all of the processing happens on the device, a small light turns on when the system is recording, and users can immediately delete images.聽
The team will April 14 at the Association for Computing Machinery Conference on Human Factors in Computing Systems in Barcelona.聽
鈥淲e haven鈥檛 seen most people adopt smart glasses or VR headsets, in part because a lot of people don鈥檛 like wearing glasses, and they often come with , such as recording high-resolution video and processing it in the cloud,鈥 said senior author , a UW professor in the Paul G. Allen School of Computer Science & Engineering. 鈥淏ut almost everyone wears earbuds already, so we wanted to see if we could put visual intelligence into tiny, low-power earbuds, and also address privacy concerns in the process.鈥
Cameras use far more power than the microphones already in earbuds, so using the same sort of high-res cameras as those in smart glasses wouldn鈥檛 work. Also, large amounts of information can鈥檛 stream continuously over Bluetooth, so the system can鈥檛 run continuous video.聽
The team found that using a low-power camera 鈥 roughly the size of a grain of rice 鈥 to shoot low-resolution, black-and-white still images limited battery drain and allowed for Bluetooth transmission while preserving performance.
There was also the matter of placement.聽
鈥淥ne big question we had was: Will your face obscure the view too much? Can earbud cameras capture the user鈥檚 view of the world reliably?鈥 said lead author , who completed this work as a UW doctoral student in the Allen School.聽
The team found that angling each camera 5-10 degrees outward provides a 98-108 degree field of view. While this creates a small blind spot when objects are held closer than 20 centimeters from the user, people rarely hold things that close to examine them 鈥 making it a non-issue for typical interactions.
Researchers also discovered that while the vision language model was largely able to make sense of the images from each earbud, having to process images from both earbuds slowed it down. So they had the system 鈥渟titch鈥 the two images into one, identifying overlapping imagery and combining it. This allows the system to respond in one second 鈥 quick enough to feel like real-time for users 鈥 rather than the two seconds it takes with separate images.
The team then had 74 participants compare recorded outputs from VueBuds with outputs from Ray-Ban Meta Glasses in a series of tests. Despite VueBuds using low-resolution images with greater privacy controls and the Ray-Bans taking high-res images processed on the cloud, the two systems performed equivalently. Participants preferred VueBuds鈥 translations, while the Ray-Bans did better at counting objects.
Related
Sixteen participants also wore VueBuds and tested the system鈥檚 ability to translate and answer basic questions about objects. VueBuds achieved 83-84% accuracy when translating or identifying objects and 93% when identifying the author and title of a book.
This study was designed to gauge the feasibility of integrating cameras in wireless earbuds. Since the system only takes grayscale images, it can鈥檛 answer questions that involve color in the scene.聽
The team wants to add color to the system 鈥 color cameras require more power 鈥 and to train specialized AI models for specific use cases, such as translation.聽聽
鈥淭his study lets us glimpse what鈥檚 possible just using a general purpose language model and our wireless earbuds with cameras,鈥 Kim said. 鈥淏ut we鈥檇 like to study the system more rigorously for applications like reading a book 鈥 for people who have low vision or are blind, for instance 鈥 or translating text for travelers.鈥澛
Co-authors include , a UW master鈥檚 student in the Allen School, and , , , and , all UW students in electrical and computer engineering.聽
For more information, contact vuebuds@cs.washington.edu.