Title
Can facial animations support speech comprehension?
Conference name
Media for All 10 Conference
City
Country
Belgium
Modalities
Date
06/07/2023-07/07/2023
Abstract
For people with hearing loss it is often difficult to understand speech, especially in noisy environments. For this reason, they often partly rely on lip-reading and other visual cues on the speaker’s face. In the presence of substantial background noise, even people without hearing loss may experience difficulties understanding speech based on the sound signal alone and often rely on visual cues as well. But in many situations, the speaker’s face is not visible—think of radio, videos with voice-over, and phone calls. This presents a challenge for speech comprehension, and more generally for the accessibility of certain types of media.
We have explored the respective benefits of two methods to convert speech into facial animations for the purpose of supporting speech comprehension when the speaker’s face is not directly visible. One method, implemented in NVIDIA’s Audio2Face application, is based on the audio-signal alone. The application takes a speech fragment as input and yields a corresponding animation of a virtual human face, i.e., an avatar, as output. We call this the AUDIO-based method. The second method makes use of computer vision technology. It consists in capturing the speaker’s face with a depth camera and using this data to obtain a corresponding facial animation. We implemented this method using a depth camera on an iPhone, and the LiveLink application by Unreal Engine, which allows one to create animated MetaHuman avatars based on data from an iPhone depth camera. We call this the VISION-based method.
Participants in our study heard sentences of the form ‘The train to [destination] is leaving at [time]’, for different destinations and departure times. During the pronunciation of the departure time substantial background noise was always present. After hearing the fragment, participants were asked at what time the train was departing, checking whether they had understood the speech signal despite the background noise. There were three conditions: (a) without visual support, (b) with an AUDIO-based facial animation, and (c) with a VISION-based facial animation. The experiment was carried out online, in Dutch. In total, 38 people with different levels of hearing loss participated.
We found that fragments with VISION-based facial animations had significantly better comprehension rates than fragments without visual support, while fragments with AUDIO-based facial animations had significantly worse comprehension rates than fragments without visual support. These effects were stronger for people with higher levels of hearing loss. We conclude from these results that, when of sufficiently high quality, facial animations can support speech comprehension, but equally importantly, when not of sufficient quality, they can also deteriorate speech comprehension. The AUDIO-based method, which is clearly the most scalable, seems not to be of sufficient quality at this time. However, we expect that further developments in this relatively young area of research will lead to further improvements and may facilitate scalable automated visual support to aid speech comprehension and increase media accessibility in the near future.
We have explored the respective benefits of two methods to convert speech into facial animations for the purpose of supporting speech comprehension when the speaker’s face is not directly visible. One method, implemented in NVIDIA’s Audio2Face application, is based on the audio-signal alone. The application takes a speech fragment as input and yields a corresponding animation of a virtual human face, i.e., an avatar, as output. We call this the AUDIO-based method. The second method makes use of computer vision technology. It consists in capturing the speaker’s face with a depth camera and using this data to obtain a corresponding facial animation. We implemented this method using a depth camera on an iPhone, and the LiveLink application by Unreal Engine, which allows one to create animated MetaHuman avatars based on data from an iPhone depth camera. We call this the VISION-based method.
Participants in our study heard sentences of the form ‘The train to [destination] is leaving at [time]’, for different destinations and departure times. During the pronunciation of the departure time substantial background noise was always present. After hearing the fragment, participants were asked at what time the train was departing, checking whether they had understood the speech signal despite the background noise. There were three conditions: (a) without visual support, (b) with an AUDIO-based facial animation, and (c) with a VISION-based facial animation. The experiment was carried out online, in Dutch. In total, 38 people with different levels of hearing loss participated.
We found that fragments with VISION-based facial animations had significantly better comprehension rates than fragments without visual support, while fragments with AUDIO-based facial animations had significantly worse comprehension rates than fragments without visual support. These effects were stronger for people with higher levels of hearing loss. We conclude from these results that, when of sufficiently high quality, facial animations can support speech comprehension, but equally importantly, when not of sufficient quality, they can also deteriorate speech comprehension. The AUDIO-based method, which is clearly the most scalable, seems not to be of sufficient quality at this time. However, we expect that further developments in this relatively young area of research will lead to further improvements and may facilitate scalable automated visual support to aid speech comprehension and increase media accessibility in the near future.