Skip to content
 

Blog post Part of special issue: Beyond ‘navel gazing’: Autoethnography as a catalyst for change

Challenging the narrative: Auto-captioning and diversity in higher education speech recognition

Claire Tupling, Research & Development Manager at A Q A Education

In this blog post, I outline key findings from my autoethnographic study examining the ways auto-generated captioning technologies (such as Panopto) in higher education (HE) transcribe my stammered speech. Stammering or stuttering is a speech difference characterised by disruption in the ongoing flow of speech, including repetitions, blocks or prolongations. The medical model regards stammering as a defect that requires fixing, while the social model is increasingly used to argue that stammering is a difference, requires no fixing and that people who stammer (pws) are disabled by an ableist environment. One such environment is HE.

By using my own experiences of stammering and auto-captioning as data, I aim to challenge the dominant narrative that auto-captioning of lectures in higher education institutions (HEIs) is inherently inclusive. I call for diversity in speech to be recognised by the AI datasets that drive auto-captioning.

‘The medical model regards stammering as a defect that requires fixing, while the social model is increasingly used to argue that stammering is a difference, requires no fixing and that people who stammer are disabled by an ableist environment.’

The agency of speech

Speech ‘has occupied a dignified position within the humanist lineage’ (St. Pierre, 2015, p. 330) and might be assumed to be essential for a HE lecturer, for teaching, and communication as well as identity formation. However, this assumption risks marginalising those HE lecturers without speech and those with non-normative speech patterns. Rather than seeing speech as representative of the human, a post-humanist approach regards speech as agentic, produced as a result of the human interacting with other human and non-human actors (Mazzei, 2013). Stammering provides an insight into understanding how speech is agentic. Sheehan (1970, p. 4) described how ‘it takes two to stutter’ meaning that stammering is produced through interaction with human and non-human interlocutors, such as voice-recognition technologies. Together, these interlocutors are agentic in producing stammering, which can be regarded as ‘a social phenomenon’ (Acton & Hird, 2004).

Interrogating auto-captions

In this project I adopted an autoethnographic methodology, drawing on post-humanist approaches to understand the relationship between myself and auto-captions. Following Birinsi and Simmons (2021) I saw myself as relational to the auto-captioning process. I reviewed presentations I had created with lecture-capturing tools in my previous role as a senior lecturer. I compared my dysfluencies (including facial movements and audible stammering) with auto-generated captions, focusing on moments where the captions misrepresented my words.

The examples selected here illustrate key moments of identity formation where captions misrepresented my speech, reducing it to mimetic features. The first example is the captioning of me saying ‘hello’, a word I can rarely say fluently:

‘Ho, ho, ho. Hello, every one.’

While the captioning does correctly identify ‘hello’, it has preceded this with ‘ho ho ho’ which replicates the struggle I experience with saying ‘hello’.

My name is captioned in multiple ways – the first in the following list arguably falls within a tolerable level of miscaptioning; however, the later examples do little to convey my actual name: Claire Tuppling

  • Klatt Toppling
  • Klatt to Topline
  • Clad Topline Ling
  • apply to top billing

To see how this looks in action you can watch recordings of my speech and captioning on YouTube.

These examples ‘are reminders that I am not normal’ (Morella-Pozzi, 2014, p. 180). Further, additional labour is required by the lecturer to correct captions to make them accessible to students. This study highlights the need for educational uses of AI-generated captions to aim towards ‘representational practice’ (Brutti & Zanotti, 2018). Autoethnography was critical to interrogating auto-captions, revealing them as inherently biased towards fluent speech. More than just errors, they fail to represent the diversity of human speech.


References

Acton, C., & Hird, M. (2004). Toward a sociology of stammering. Sociology, 38(3), 495–513. https://doi.org/10.1177/0038038504043215

Birinsi, T. & Simmons, J. (2021). Posthumanist autoethnography. In E. Adams, S. Holman Jones & C. Ellis (Eds.), Handbook of autoethnography. Routledge.

Bruti, S., & Zanotti, S. (2018). Representations of stuttering in subtitling: A view from a corpus of English language films. In I. Ranzato and S. Zanotti (Eds.), Linguistic and cultural representation in audiovisual translation (pp. 228–262). Routledge.

Mazzei, L. (2013). A voice without organs: Interviewing in post-humanist research. International Journal of Qualitative Studies in Education, 26(6), 732–740. https://doi.org/10.1080/09518398.2013.788761

Morella-Pozzi, D. (2014). The (dis)ability double life: Exploring legitimacy, illegitimacy, and the terrible dichotomy of (dis)ability in higher education. In R. Boylorn & M.Orbe (Eds.), Critical autoethnography: Intersecting cultural identities in everyday life. Routledge.

Sheehan, J. G. (1970). Stuttering: Research and therapy. Harper & Row.

St. Pierre, J. (2015). Cripping communication: Speech, disability, and exclusion in liberal humanist and posthumanist discourse. Communication Theory, 25(3), 330–348. https://doi.org/10.1111/comt.12054