Title
From caption to description. Improving the accessibility of audiovisual storytelling for different audiences through human-machine integrated workflows for video-to-text translation
Conference name
EST Congress 2022
City
Country
Norway
Date
22/06/2022-25/06/2022
Abstract
In today’s increasingly multimodal world, a crucial prerequisite for participating meaningfully in society is the ability to access and assimilate audiovisual stories. This requires a robust approach to meeting access needs, including those of visually impaired people, individuals with sensory and cognitive abilities, and people with different levels of literacy or language skills, mainly by translating audiovisual stories into, or complementing them with, subtitles, audio descriptive, easy-to-understand or similar texts. It also requires a radical review of workflows and training needs across the audiovisual industry. However, human resource is not always available to cover all contexts, especially the exponentially growing visual content on social media and the web. Automated approaches to describing visual content (‘video captioning’) continue to improve, due to more widely available video training data, advances in computer vision and deep learning which progress object and character tracking in video scenes, action and scene detection, and so forth. Yet, extracting a multimodal understanding from an audiovisually told story, and casting it into words, remains a significant challenge for a machine (Krishna et al., 2017; Aafaq et al., 2019; Braun & Starr, 2021). To increase the capacity for creating meaningful (and personalised/ customised) access to audiovisual content for diverse audiences, several challenges have to be addressed, namely (1) improving our understanding of how people with diverse abilities understand and translate audiovisual stories; (2) designing human-machine integrated workflows that complement human capability in video-to-text translation (instead of replacing it), with due regard for the wellbeing of the human experts; (3) responsible creation of unbiased datasets and development of explainable algorithms to enhance (semi)automated storytelling. This presentation addresses the first two challenges and highlights implications for the third. Based on the recently completed MeMAD project, we first consider approaches to human multimodal storytelling, and the state of the art of automated ‘video captioning’. Drawing on a user experience study that tested the MeMAD video description prototype (Braun et al., 2021), which enables professional video describers to post-edit machine-generated video captions, we then discuss the prerequisites for a successful, access-focused human-machine workflow for video-to-text translation. Finally, we will consider the questions and opportunities arising for training of language professionals in these emerging human-machine workflows.
Submitted by María Eugenia … on Wed, 20/09/2023 - 14:19