Why AI Still Can’t Beat Human Transcribers
A trained transcriptionist rants about transcription software
First, some context:
Before I was a podcast producer, I was a closed captioner for live television (and before you ask, I’m not a fast typist – I was a voice writer, so you repeat the words you hear into a voice model rather than typing them out like a stenographer).
In a recent Vocal Friday newsletter, Jay Cockburn said he didn’t want to be “the rant guy” when it came to slagging off buggy transcription software.
But I am a rant guy, especially about transcription stuff.
I left the captioning industry with a healthy skepticism towards automated AI transcript generators. I’ve used a fair number of transcription software over the years, and across the board, most of the options out there are pretty mediocre at best. Unfortunately, we live in a capitalistic society that continually pushes everyone to work faster and cheaper, and AI transcript generators undoubtedly have an edge here. Yet, companies seem more invested in adding unnecessary features that many users don’t want, rather than making core software engines functional.
Now, that’s not to say that I don’t respect what transcription software is capable of: accessibility on the Internet is only alive and well thanks to tools like YouTube’s auto captioning paving the way (and users speaking up when it wasn’t meeting their standards). As a podcast producer, I absolutely use auto transcription tools regularly to help speed up workflows, including generating final transcripts. But I still insist that such transcripts be reviewed by a real human before they go live.
Enter Descript: a text-based magic bullet for audio editing?
One of the more talked about transcription software lately is Descript, a tool that translates text edits into non-destructive audio or video edits (honestly, pretty cool and quite useful both theoretically and practically, when it’s done well).
Earlier this year, I was working with another producer who loved Descript and preferred to use it for first pass edits. And frankly, I do think that being able to edit the transcript and then bounce the file in Audition directly is pretty cool. I was paying for a license, but I canceled it a couple months ago. I stopped using it for the same reason I’ve binned a bunch of other services – I find the transcription engine is really garbage unless your audio is pretty much perfect. And by “perfect,” I mean that it’s been recorded with very little background noise, on high-end equipment – and that it’s from a speaker who speaks clearly, slowly, and usually with a generic North American accent. Any files that I’ve worked on where the speaker is ESL – or even English as their primary language, but they have a non-typical North American accent – Descript generates what I can only describe as gobbledegook.
If it ain’t broke…
For the past few weeks, anyone who uses Descript has been getting a lot of updates – 15 different patch and fix releases since August 8, to be precise. But as Descript continues to roll out updates, it seems to me like they're pivoting away from audio and towards video, which means that podcasters will need to sort through more features to get the functionality they actually need.
Their latest marketing video highlights two new crown jewels: Descript Storyboard, an “intuitive redesign” that introduces new features geared towards content creators (greenscreen, templates, stock footage), and a partnership with OpenAI to source Series C funding.
Does OpenAI do cool stuff? Sure. Do any of the things they do look particularly useful for my job as a podcast producer? Not that I can see, since most of these tools have more to do with coding than audio production. But maybe the $50 million is to make some new tools that better align with Descript’s main functions.
Speak to any editor of any medium and they’ll tell you that timeline-style editing is a huge time suck. There can be an incredibly daunting learning curve for anyone who’s not a professional with hours of experience already banked. But even if you’ve never touched editing software before, you’ve probably worked in some kind of word processor, so the familiarity is already there with the new storyboard function in Descript. But as someone who already has had a fair amount of practice in timeline editors, I don’t know how much this new visual format really appeals to me…especially because I don’t trust that Descript want to make things easier communicates to keeping the same level of quality that I know I can get through a proper digital audio workstation.
Descript Storyboard has actually been around for a few months now, and they’ve been trying to get podcasters on board with blog posts like this one from July titled “Descript Storyboard: What’s new for podcasters.” I took one look at Descript Storyboard prior to this latest update and decided that it was not the tool for me, though I must admit that I was kind of wowed by the new bells and whistles. But it seemed geared more towards video production rather than audio. And it doesn’t solve any of my existing complaints that I had about the core of software.
Does Silicon Valley understand what audio producers actually need?
I stand by my knee-jerk reaction: this is not a tool for podcasters, and they are not growing in a direction that is helpful for our industry. I think this is the most telling quote from the video, narrated by CEO and founder of Descript, Andrew Mason: “[We’re] really trying to just reimagine the way that video editors are designed so they’re no longer a tool for creative professionals, exclusively, but they’re in everybody’s communication toolkit right there next to [Google] Docs and Slides.”
It seems like they are definitely trying to pivot to be more of a video editor than an audio editor. But at the same time, they want to be ubiquitous. These, to me, are two of the worst things that you can offer podcasters. Our whole industry is built on being audio-first and unique. I don’t want a ubiquitous tool; I already pay Microsoft an absurd amount of money every year (don’t even get me started on the capitalistic evil that is subscription versus standing licenses for software) for, pardon my French, shit that barely works. While I’m prepared to pay for tools that help me do my highly specific job better and easier, seeing companies trying to make my job faster and cheaper just kind of sucks. It definitely makes me not want to give them my money.
When I started as a closed captioner, the reason I was hired was because I had an English degree (which was a requirement no one else was really looking for at that point in my city and trust me I checked). English majors usually make for good captioners because you have a supposedly better-than-average understanding of one of the most complex and frustrating components of language: grammar and punctuation. When you speak, you’re not thinking about where you should be putting a period or a comma, but when you’re converting spoken language into a written language you must translate all those missing formatting details in order to make it readable. In captioning, knowing where to put punctuation is vital because that’s how you break up lines of text, and especially to a caption reader who is deaf or hearing impaired, having those details is incredibly important for comprehension.
Putting AI transcription software to the producer Turing test
So, why does all that matter for podcasting?
Well, it’s frustrating to see Descript’s marketing materials because it almost comes off as false advertising. In the promo video for Storyboard, Mason’s words scroll out in real time, with perfect punctuation, as if that’s how it was generated in the software. But I took the audio from the video and actually ran it through Descript myself, and here’s the worst segment of what I got:
The image is in night mode because it was night when I screen grabbed it. While I appreciate the new night mode update, it would have been even cooler if I could figure out how to toggle it off for this exact scenario.
I still do quality control for captions, and if I apply the same rules to this transcript as the verbatim transcript that I edited, this real-time transcript right from Descript scores 93.47%.
This is an example of what the side-by-side comparison for the first page of this file looks like in this scoring system - view the whole file here.
Technically on any other standard scoring system this real-time transcript would be an A, but by the CRTC’s own standards for captions and transcripts under this scoring system anything under 95% is considered substandard and therefore non-passing. Maybe that’s not a big deal to you if you’re just using this to edit the audio file and won’t be publishing the transcript, but think about these implications:
1) Andrew Mason is a typical North American male speaker. If HIS voice with super clean audio is 93.47% accurate, what level of quality are we expecting for anyone whose voice is not male and North American?
2) Descript’s whole thing is that they want things to be easy, and they’re implying in their videos that the transcripts are pretty well perfect when they’re just irrefutably not. I’ve seen far too many podcasts that think a rough AI generated transcript is better than nothing, when it really is not because the two biggest things AI struggles with in translating spoken language to written language are proper nouns and grammar. If you’re using transcripts for SEO, then the proper nouns are pretty important, and if you’re using transcripts for accessibility, then grammar is pretty important. And really, you should be using transcripts for both to get the most bang for your buck and to be respectful to your audience of all manners of ability.
This is not a problem I’m trying to pin on Descript specifically – there’s lots of problems with AI voice recognition and transcript generation on the whole, from gender and race bias to security issues (looking at you, Otter.ai.) But it’s frustrating to see companies who already aren’t doing the core thing they’re touting continue to add on other services that aren’t really helpful to what I thought was a fairly large segment of their usership.
Clearly, Descript thinks it is in fact better to be a jack of all trades even if they’re a master of none. Personally, I would be thrilled to pay for a robust, smart, secure master-of-one transcription software if I knew I could get better quality transcripts even when my speaker falls outside the conventional standards for “Good Audio.”
Screenshot from the Temi website - they’re another example of a transcription software, and the fact that they characterize accents as “Difficult Audio” is problematic, to say the least.
There’s still hope, though – I’ve been a long-time Audition fan as my go-to DAW, and Adobe has just begun rolling out the beta of Adobe Podcast (previously Project Shasta). From the first looks on their website they’re certainly taking a leaf out of Descript’s book in terms of using transcripts as the base of the software with the timeline function working beneath the document surface, but I’m hopeful that it might actually be designed for podcasters.
Only time will tell if their AI transcription engine actually works (Adobe, if you’re reading this, I’ll happily test drive the beta and be ruthless about my feedback). But if they are anything like Descript, they’ll probably make money hand over fist, regardless of their actual quality.
Unfortunately my advice is to complain to Descript about this, but knowing it will likely not go anywhere I’d consider spending your software money elsewhere.
I know that transcription software will likely never be good enough to fully replace humans. But I’m sick of paying for Swiss-Army knife software trying to be the everything tool for the “everyday professional”, because for technical jobs like editing this is the definition of oxymoronic. I don’t actually want the extra features companies like Descript keep developing: I just want a half-decent transcription engine that I can rely on, and is good enough (it doesn’t even need to be perfect) to warrant what I’m paying for it.
 
          
        
       
              
             
             
             
             
             
             
             
            