Title
"Sense and sensibility". Examining bias in human-generated audio description and machine-derived video captions and the implications for video description in a digital world
Conference name
Media for All 10 Conference
City
Country
Belgium
Modalities
Date
06/07/2023-07/07/2023
Abstract
Video captioning workstreams, aimed at improving the accessibility of moving images for audiences with supplementary physical and cognitive needs, currently sit at the intersection between human endeavour and digital innovation. Human-derived audio description (AD) is generally accurate, narratively salient, and creatively engaging, but prohibitive production costs mean that it is not viable to describe large volumes of material on a scale currently available to the sighted audience. (Semi-) automated video captioning is being developed as an alternative, parallel, offering.
However, in the conversation surrounding video caption automation one element that has been largely overlooked is a risk-benefit analysis of underlying bias, both human and computational. From the automation perspective, generating AI-assisted video captions requires pre-training of the machine using image-based training data. Flikr and Tumblr are frequently used sources of imagery, resulting in situational and generational bias e.g., references to technology, high-adrenaline sports etc. (Braun & Starr, 2019). As a result, machine-generated captions contain errors due to the over-representation of contemporary objects and topics. They also exhibit poor grammar, reflecting the recruitment methods used to employ training data captioners (i.e., amateur crowdworkers, piecemeal payment, inadequate lexical skills, lax adherence to captioning rules). Gender bias is evident in the machine-based captions with an over-representation of male gendered words, and the rigid pursuit of outdated social conventions (men wear trousers and have short hair, women have long(er) hair and wear skirts). Machine-induced bias is further compounded by a lack of training in the detection of narrative saliency, inability to detect nuanced actions, and failure to employ audio cues to validate or correct visually biased captions. Computational methods also reflect a lack of resource to apply life knowledge and common sense to labelling iconic visual representations (e.g., a wedding, the farmers’ market, graduation celebrations).
By contrast, professional audio describers, drawn from a language/communications background, generally find video captioning an intuitive task, readily moving between decoding nuanced human behaviours, contextualised storytelling, and information prioritisation, to produce a multi-dimensional AD script. Nevertheless, models of discourse/narrative processing have highlighted different types of bias in human storytelling. In particular, the models point to differences in world knowledge, experiences and contexts of reception to explain why individual recipients form different understandings of the same material. Like any translation activity, AD also creates a double comprehension ‘filter’: impacted by the interpretation and agency of the audio describer, with an understanding of the audiovisual source material based on their own cognitive environment; and subsequently ‘consumed’ by the target audience, which adds a further layer of subjectivity when deriving meaning.
Our empirically-derived conceptual presentation examines these issues of bias in machine-generated and human video description, asking how machine description could become more humanlike without succumbing to the pitfalls of personal bias and social convention which impact objectivity. We will consider solutions that draw on the best of both worlds, and envisage how these would play out in a world that is increasingly replacing human endeavour with AI-based solutions that lack transparency and (often) fail to satisfy ethical and quality standards.
However, in the conversation surrounding video caption automation one element that has been largely overlooked is a risk-benefit analysis of underlying bias, both human and computational. From the automation perspective, generating AI-assisted video captions requires pre-training of the machine using image-based training data. Flikr and Tumblr are frequently used sources of imagery, resulting in situational and generational bias e.g., references to technology, high-adrenaline sports etc. (Braun & Starr, 2019). As a result, machine-generated captions contain errors due to the over-representation of contemporary objects and topics. They also exhibit poor grammar, reflecting the recruitment methods used to employ training data captioners (i.e., amateur crowdworkers, piecemeal payment, inadequate lexical skills, lax adherence to captioning rules). Gender bias is evident in the machine-based captions with an over-representation of male gendered words, and the rigid pursuit of outdated social conventions (men wear trousers and have short hair, women have long(er) hair and wear skirts). Machine-induced bias is further compounded by a lack of training in the detection of narrative saliency, inability to detect nuanced actions, and failure to employ audio cues to validate or correct visually biased captions. Computational methods also reflect a lack of resource to apply life knowledge and common sense to labelling iconic visual representations (e.g., a wedding, the farmers’ market, graduation celebrations).
By contrast, professional audio describers, drawn from a language/communications background, generally find video captioning an intuitive task, readily moving between decoding nuanced human behaviours, contextualised storytelling, and information prioritisation, to produce a multi-dimensional AD script. Nevertheless, models of discourse/narrative processing have highlighted different types of bias in human storytelling. In particular, the models point to differences in world knowledge, experiences and contexts of reception to explain why individual recipients form different understandings of the same material. Like any translation activity, AD also creates a double comprehension ‘filter’: impacted by the interpretation and agency of the audio describer, with an understanding of the audiovisual source material based on their own cognitive environment; and subsequently ‘consumed’ by the target audience, which adds a further layer of subjectivity when deriving meaning.
Our empirically-derived conceptual presentation examines these issues of bias in machine-generated and human video description, asking how machine description could become more humanlike without succumbing to the pitfalls of personal bias and social convention which impact objectivity. We will consider solutions that draw on the best of both worlds, and envisage how these would play out in a world that is increasingly replacing human endeavour with AI-based solutions that lack transparency and (often) fail to satisfy ethical and quality standards.