Comparing human and automated approaches to video description
Conference name
Media for All 8
The recent proliferation of (audio)visual content on the Internet, including increased user-generated content, intersects with European-wide legislative efforts to make (audio)visual content more accessible for diverse audiences. As far as access to visual content is concerned, audio description (AD) is an established method for making content accessible to audiences with visual impairment. However, AD is expensive to produce and its coverage remains limited. This applies particularly to the often ephemeral user-generated (audio)visual content on social media, but the Internet more broadly remains less accessible for people with sight loss, despite its high social relevance for people’s everyday lives.

Advances in computer vision, machine learning and AI have led to increasingly accurate automatic image description. Although currently focused on still images, attempts at automating moving image description have also begun to emerge (Huang et al. 2015, Rohrbach et al. 2017). One obvious question arising from these developments is how machine-generated descriptions compare with their human-made counterparts. Initial examination reveals stark differences between the two methods. A more immediate question is where human endeavour might prove most fruitful in the development of effective approaches to automating moving image description.

This presentation reports on an initial study comparing human and machine-generated descriptions of moving images, aimed at identifying the key characteristics and patterns of each method. The study draws on corpus-based and discourse-based approaches to analyse, for example, lexical choices, focalisation and consistency of description. In particular, we will discuss human techniques and strategies which can inform and guide the automation of description. The broader aim of this work is to advance current understanding of multimodal content description and contribute to enhancing content description services and technologies.

This presentation is supported by an EU H2020 grant (MeMAD: Methods for Managing Audiovisual Data: Combining Automatic Efficiency with Human Accuracy).
Submitted by Irene Tor on Fri, 05/07/2019 - 09:45