Title
A modular automated audio description script generator
Author(s)
Conference name
Languages & the Media 2024
City
Country
Hungary
Modalities
Date
13/11/2024-15/11/2024
Abstract
Audio description is more than just a technical requirement, it plays an invaluable role that enables blind or partially sighted individuals to engage with visual media in a meaningful way. It opens up a world of experiences by transforming visuals into words, allowing everyone to participate in the stories, emotions, and information conveyed on screen. Despite its importance, the process of creating these descriptions has traditionally been slow, labour-intensive, and dependent on a deep understanding of both the media content and the audience's specific needs.

Recognizing that even with a rise in the production of audio-described materials, a significant amount of content still remains inaccessible, I was inspired to explore how technology could help bridge this gap. My goal was to develop an automated system that could transform the way audio descriptions are created, using readily available, open-source tools and AI-based solutions. I wanted to see if we could turn what is often a manual, time-consuming task into an efficient, scalable process that could make audio descriptions more widely available.

By integrating advanced audio and video processing techniques with AI-powered image captioning models, this approach not only drastically reduces the time and effort needed to produce audio descriptions but also ensures that blind or partially sighted audiences can enjoy content which would otherwise not be audio described. The core idea of the design was to create a tool that would allow the creation of audio description where it was not available.

The workflow addresses some of the most significant challenges in the audio description process: pinpointing the best moments for inserting descriptions, generating scene descriptions, and aligning these seamlessly with the audio timeline. By harnessing some of the more recent AI advances, I attempted to create an automated solution that could generate an audio description, which although far from perfect, aims to provide information where there would normally be none.

The focus fell on the most difficult aspect of the Audio Description creation process, the creation of the audio description script and the task of interpreting and transforming the visual information into text.

Process Overview

The automated workflow begins by extracting the audio from the media asset using FFMPEG, converting it into a WAV file for subsequent processing. Using Automatic Speech Recognition (ASR), we attempt to identify sections in the audio where no speech is present, creating a draft subtitle file with empty title boxes timed to these gaps. These non-speech intervals mark the ideal candidate spots for inserting audio description lines.

For media files that already have an accompanying subtitle track, the workflow can generate the draft audio description script directly from this existing file. It does this by reversing the subtitle timing to pinpoint areas of silence, thereby identifying areas where descriptions could be inserted.

Scene Analysis and Frame Extraction

Following the audio analysis, we then analyse the video content and divide it into scenes. A script analyses the video frames, comparing similarities between them to identify distinct scenes, which are then catalogued in a scene list. This list guides the extraction of representative frames, which will be used to generate detailed visual descriptions.

Two extraction strategies are employed: a basic extraction method that processes frames without considering speech intervals and an advanced method that synchronises frame extraction with the draft script, ensuring alignment with the non-speech segments of the audio track. The latter approach is more precise, extracting only those frames that coincide with suitable gaps in the dialogue, thus optimising the description process.

Generating and Inserting Image Descriptions

The core of this workflow involves generating image descriptions for each extracted frame. Utilising APIs hosted on platforms like Hugging Face, the system employs various image captioning models, such as ViT with GPT-2 and BLIP by Salesforce, to create initial descriptions. These models analyse the visual content in each frame and generate detailed captions.

Two methods are available for generating these descriptions. The simple method relies on a single model, while the advanced method aggregates outputs from multiple models to form a more comprehensive description. By analysing the most frequent phrases across different models' outputs, the advanced approach attempts to create a balanced and accurate portrayal of each scene.

Since I was dealing with multiple descriptions, I also added an option that would look at the results and attempt to rephrase it and correct any logical inconsistencies using OpenAI's GPT-3.5 or GPT-4 or Anthropic’ s Claude. By playing around with various types of prompts which would direct the LLM to act as a quality checker, it was possible to introduce a crude QC step in the workflow.

Final Integration into the Audio Description Script

The final step in the generation of audio description script is represented by the integration of the image descriptions as audio description lines into the audio description script draft. A script automates this step, inserting the appropriate descriptions into the designated non-speech slots identified earlier. The final output is a structured subtitle file that aligns perfectly with the media's timeline, which can then be used to generate the Audio Description Script Lines using Speech Synthesis. These generated audio description lines can then be stitched together into a mono wav or mixed into the original media.

Conclusion

While far from being a perfect solution, I believe that the Automated Audio Description Script Generator developed in this research could be considered a step forward in the journey to make visual media more accessible. The system has many limitations, primarily in its ability to consistently generate accurate and contextually rich descriptions. The AI-generated captions often lack the depth and nuance that a human describer can provide, highlighting the gap between the capabilities of the technologies used and the complexities of real-world media content.

I believe that one of the most valuable outcomes of this research is not the creation of a perfect solution but the groundwork it lays for future development. It highlights the feasibility of combining audio and video analysis with AI-driven image captioning to create a more scalable and efficient method of producing audio descriptions. While the descriptions themselves may still require human refinement, the automated process offers a glimpse into how technology could eventually transform this traditionally manual and resource-intensive task.

This project has shown that, even with its imperfections, the approach can begin to fill the gap in audio description availability, providing a base that can be built upon and improved. It's a small but important step toward more inclusive media, which would allow audiences to enjoy digital in an accessible way. By using widely available, open-source technologies and having been built with modularity at its core, I believe that such a system architecture could serve as a base upon which others can build and further improve.
Submitted by miguelaoz on Tue, 07/01/2025 - 16:50