How BBC Sounds is using GenAI to boost accessibility in audio

A new BBC Sounds pilot, running on the OpenAI Whisper transcribing model, explores how generative AI might automate subtitles for its 27,000 hours of audio content it produces a month

242819 Bbc Gen Ai (1)

As a public body, the BBC’s Royal Charter states its commitment to serving everyone in the UK – including people with accessibility needs. It was these public obligations that led the BBC to become the first broadcaster to introduce closed captioning in its programming via the Ceefax Teletext service in 1979.

Accessibility in tech has come a long way since then, when remote controls had Teletext buttons and punching ‘888’ into the device would summon subtitles on your TV. Every streaming platform now offers the service.

It’s important we get it right, rather than do it as quickly as we can

But, despite podcasts attracting an audience of half a billion people annually, closed captioning for audio content is far less common. The BBC produces 27,000 hours of audio content a month but much of this remains inaccessible to those with hearing impairments.

The product team at BBC Sounds – the streaming service for BBC radio and podcasts – has been exploring ways to address this issue and integrate subtitles into its platform for some time, according to senior product manager Sam Barns.

Discussions with colleagues revealed broad interest in the feature, not only as an accessibility tool but also as a resource for those interested in learning languages.

Getting the pilot off the ground

Barns and her team began by reviewing the tools already at its disposal, so as to avoid “reinventing the wheel”. The team didn’t want to build a full-scale solution until they were confident the pilot would be used by enough people.

“We wanted to understand how we might test this at a small scale and that’s when we started speaking to R&D and accessibility and other tooling partners throughout the business,” says Barns.

It wasn’t until advances in generative AI – namely OpenAI’s automatic speech recognition software, Whisper, released in late 2022 – that the R&D team found a solution.

R&D had been monitoring the quality of generative AI systems for some time, says Andrew McParland, principal research and development engineer in the BBC’s R&D department. But they noticed a “step-change” in performance when using Whisper for transcriptions. “The output was so much better and it could be run at scale,” McParland says.

The resulting application was a conglomeration of pre-existing BBC speech-to-text technology and third-party tools. Because Whisper models are freely available for anyone to use, the BBC was able to use it alongside its own technology to add features, such as speaker diarisation, which is a way to identify multiple different speakers in a single piece of content.

The BBC has recently started trialling transcription and subtitle services on BBC Sounds, where its being used internally at the BBC and by a small section of the public for certain programmes. The long-running BBC Radio 4 show The Archers next in line for captioning.

How BBC Sounds transcription works

This is how it works: anyone who wants to generate a transcript enters a unique identifier for the content they’re looking for. This gets downloaded and processed on a machine in the cloud. A transcript is produced, which the user can then feed into software that allows them to review or edit the document. 

Finally, the transcript is published in a standardised format for use in subtitle players. This also enables it to be easily adapted for other applications.

Users experience the software differently depending on the platform. The Sounds web application will show text as it’s being spoken. Meanwhile, in the mobile app, the team has built ‘Time Transcript’, which functions similarly to Spotify’s lyrics feature, where captions appear as they’re being spoken but the user is also able to scroll through them at their own pace.

Proceed with caution

While the BBC is experimenting with AI, due to its public obligations, it has to prioritise accuracy, trust and transparency over speed. As part of that mission, the organisation has instructed that all use of generative AI must keep a human “in the loop”. “We’re taking a very careful, cautious approach,” says McParland. “It’s important we make sure we get it right, rather than do it as quickly and easily as we can.”

AI is prone to making mistakes, some of which could be reputationally damaging. This means the transcripts still need to be checked manually, for the time being.

“Having that human in the loop and editing the generated transcript allowed us to meet the accuracy rate of the original transcript, so we now understand the percentage of errors that we can expect to see on our content if the transcript goes straight out to the audience,” says Barns. This has allowed the product team to determine which content is low-risk and could be transcribed by AI without the need for manual checks.

Building reference data

Refining the tool to make the process more efficient remains a challenge. Currently, there’s a delay in publishing the transcripts because the team has to wait for the audio files before subtitles can by generated. “We want to get to a stage where the transcript is ready at the same time as the audio,” Barns says.

Although time consuming, manual testing has proved useful for creating reference data for future generative AI tools. Although the current workflow is not ideal, it has been helpful to have people review the transcripts to see where there’s room for improvement.

“We’ve got reference transcripts of what they should look like and what we’ve produced,” says McParland. “So if we do any testing in future, we know what good looks like.”

This approach will position the BBC’s product team well to evaluate its audio captioning system and any other generative AI tools they choose to evaluate.