OpenAI Charges by the Minute, So Make the Minutes Shorter
• ~2,000 words • 9 minute read
Want to make OpenAI transcriptions faster and cheaper? Just speed up your audio.
I mean that very literally. Run your audio through ffmpeg at 2x or 3x before transcribing it. You’ll spend fewer tokens and less time waiting with almost no drop in transcription quality.
That’s it!
Here’s a script combining of all my favorite little toys and tricks to get the job. You’ll need yt-dlp, ffmpeg and llm installed.
# Extract the audio from the video
yt-dlp -f 'bestaudio[ext=m4a]' --extract-audio --audio-format m4a -o 'video-audio.m4a' "https://www.youtube.com/watch?v=LCEmiRjPEtQ" -k;
# Create a low-bitrate MP3 version at 3x speed
ffmpeg -i "video-audio.m4a" -filter:a "atempo=3.0" -ac 1 -b:a 64k video-audio-3x.mp3;
# Send it along to OpenAI for a transcription
curl --request POST \
--url https://api.openai.com/v1/audio/transcriptions \
--header "Authorization: Bearer $OPENAI_API_KEY" \
--header 'Content-Type: multipart/form-data' \
--form [email protected] \
--form model=gpt-4o-transcribe > video-transcript.txt;
# Get a nice little summary
cat video-transcript.txt | llm --system "Summarize the main points of this talk."
I just saved you time by jumping straight to the point, but read-on if you want more of a story about how I accidentally discovered this while trying to summarize a 40-minute talk from Andrej Karpathy.
Also read-on if you’re wondering why I didn’t just use the built-in auto-transcription that YouTube provides, though the short answer there is easy: I’m sort of a doofus and thought—incorrectly—it wasn’t available. So I did things the hard way.
I Just Wanted the TL;DW(atch)
A former colleague of mine sent me this talk from Andrej Karpathy about how AI is changing software. I wasn’t familiar with Andrej, but saw he’d worked at Tesla. That coupled with the talk being part of a Y Combinator series and 40 minutes made me think “Ugh. Do I… really want to watch this? Another 'AI is changing everything' talk from the usual suspects, to the usual crowds?”
If ever there were a use-case for dumping something into an LLM to get the gist of it and walk away, this felt like it. I respected the person who sent it to me though and wanted to do the noble thing: use AI to summarize the thing for me, blindly trust it and engage with the person pretending I had watched it.
My first instinct was to pipe the transcript into an LLM and get the gist of it. This script is the one I would previously reach for to pull the auto-generated transcripts from YouTube:
yt-dlp --all-subs --skip-download \
--sub-format ttml/vtt/best \
[url]
For some reason though, no subtitles were downloaded. I kept running into an error!
Later, after some head-scratching and rereading the documentation, I realized my version (2025.04.03) was outdated.
Long story short: Updating to the latest version (2025.06.09) fixed it, but for some reason I did not try this before going down a totally different rabbit hole. I guess I got this little write-up and exploration out of it though.
If you care more about summarizing transcripts and less about the vagaries of audio-transcriptions and tokens, this is the correct answer and your off-ramp.
My Transcription Workflow
I already had an old, home-brewed script that would extract the audio from any video URL, pipe it through whisper locally and dump the transcription in a text file.
That worked, but I was on dwindling battery power in a coffee shop. Not ideal for longer, local inference, mighty as my M3 MacBook Air still feels to me. I figured I would try offloading it to OpenAI’s API instead. Surely that would be faster?
Testing OpenAI’s Transcription Tools
Okay, using the whisper-1
model it’s still pretty slow, but it gets the job done. Had I opted for the model I knew and moved on, the story might end here.
However, out of curiosity, I went straight for the newer gpt-4o-transcribe
model first. It’s built to handle multimodal inputs and promises faster responses.
I quickly hit another roadblock: there’s a 25-minute audio limit and my audio was nearly 40 minutes long.
Let's Try Something Obvious
At first I thought about trimming the audio to fit somehow, but there wasn’t an obvious 14 minutes to cut. Trimming the beginning and end would give me a minute or so at most.
An interesting, weird idea I thought about for a second but never tried was cutting a chunk or two out of the middle. Maybe I would somehow still have enough info for a relevant summary?
Then it crossed my mind—what if I just sped up the audio before sending it over? People listen to podcasts at accelerated 1-2x speeds all the time.
So I wrote a quick script:
ffmpeg -i video-audio.m4a -filter:a "atempo=2.0" -ac 1 -b:a 64k video-audio-2x.mp3
Ta-da! Now I had something closer to a 20 minute file to send to OpenAI.
I uploaded it and… it worked like a charm! Behold the summary bestowed upon me that gave me enough confidence to reply to my colleague as though I had watched it.
But there was something... interesting here. Did I just stumble across a sort of obvious, straightforward hack? Is everyone in the audio-transcription business already doing this and am I just haphazardly bumbling into their secrets?
I had to dig deeper.
Why This Works: Our Brains Forgive, and So Does AI
There’s an interesting parallel here in my mind with optimizing images. Traditionally you have lossy and lossless file formats. A lossy file-format kind of gives away the game in its description—the further you crunch and compact the bytes the more fidelity you’re going to lose. It works because the human brain just isn’t likely to pick-up on the artifacts and imperfection
But even with a “lossless” file format there are tricks you can lean into that rely on the limits of human perception. One of the primary ways you can do that with a PNG or GIF is reducing the number of unique colors in the palette. You’d be surprised by how often a palette of 64 colors or fewer might actually be enough and perceived as significantly more.
There’s also a parallel in my head between this and the brain’s ability to still comprehend text with spelling mistakes, dropped words and other errors, i.e. transposed letter effects. Our brains have a knack for filling in the gaps, and when you go looking through the world with magnifying glass you'll start to notice lots of them.
Speeding up the audio starts to drop the more subtle sounds and occasionally shorter words from the audio, but it doesn’t seem to hurt my ability to comprehend what I’m hearing—even if I do have to focus. These audio transcription models seem to be pretty good at this as well.
Wait—how far can I push this? Does It Actually Save Money?
Turns out yes. OpenAI charges for transcription based on audio tokens, which scale with the duration of the input. Faster audio = fewer seconds = fewer tokens.
Here are some rounded numbers based on the 40-minute audio file breaking down the audio input and text output token costs:
Speed | Duration (seconds) | Audio Input Tokens | Input Token Cost | Output Token Cost |
---|---|---|---|---|
1x (original) | 2,372 | NA (too long) | NA | NA |
2x | 1,186 | 11,856 | $0.07 | $0.02 |
3x | 791 | 7,904 | $0.04 | $0.02 |
That’s a solid 33% price reduction on input tokens at 3x! However the bulk of your costs for these transcription models are still going to be the output tokens. Those are priced at $10 per 1M tokens whereas audio input tokens are priced at $6 per 1M token as of the time of this writing.
Also interesting to note—my output tokens for the 2x and 3x versions were exactly the same: 2,048. This kind of makes sense, I think? To the extent the output tokens are a reflection of that models ability to understand and summarize the input, my takeaway is a “summarized” (i.e. reduced-token) version of the same audio yields the same amount of comprehensibility.
This is also probably a reflection of the 4,096 token ceiling on transcriptions generally when using the gpt-4o-transcription
model. I suspect half the context window is reserved for the output tokens and this is basically reflecting our request using it up in its entirety. I suspect we might get diminishing results with longer transcriptions.
But back to money.
So the back-of-the-envelope calculator for a single transcription looks something like this:
6 * (audio_input_tokens / 1_000_000) + 10 * (text_output_tokens / 1_000_000);
That does not quite seem to jibe with the estimated cost of $0.006 per minute stated on the pricing page, at least for the 2x speed. That version (19-20 minutes) seemed to cost about $0.09 whereas the 3x version (13 minutes) cost about $0.07 (pretty accurate actually), if I’m adding up the tokens correctly.
# Pricing for 2x speed
6 * (11_856 / 1_000_000) + 10 * (2_048 / 1_000_000) = 0.09
# Pricing for 3x speed
6 * (7_904 / 1_000_000) + 10 * (2_048 / 1_000_000) = 0.07
It would seem that estimate isn’t just based on the length of the audio but also some assumptions around how many tokens per minute are going to be generated from a normal speaking cadence.
That’s… kind of fascinating! I wonder how John Moschitta’s feels about this.
Comparing these costs to whisper-1
is easy because the pricing table more confidently advertises the cost—not “estimated” cost—as a flat $0.006 per minute. I’m assuming that’s minute of audio processed, not minute of inference.
The gpt-4o-transcription
model actually compares pretty favorably.
Speed | Duration | Cost |
---|---|---|
1x | 2372 | $0.24 |
2x | 1186 seconds | $0.12 |
3x | 791 seconds | $0.08 |
Does This Save Money?
In short, yes! It’s not particularly rigorous, but it seems like we reduced the cost of transcribing our 40-minute audio file by 23% from $0.09 to $0.07 simply by speeding up the audio.
If we could compare to a 1x version of the audio file trimmed to the 25-minute limit, I bet we could paint an even more impressive picture of cost reduction. We kind of can with the whisper-1
chart. You could make the case this technique reduced costs by 67%!
Is It Accurate?
I don’t know—I didn’t watch it, lol. That was the whole point. And if that answer makes you uncomfortable, buckle-up for this future we're hurtling toward. Boy, howdy.
More helpfully, I didn’t compare word-for-word, but spot checks on the 2x and 3x versions looked solid. 4x speed was too fast—the transcription started getting hilariously weird. So, 2x and 3x seem to be the sweet spot between efficiency and fidelity, though it will obviously depend on how fast the people are speaking in the first place.
Why Not 4x?
When I pushed it to 4x the results became comically unusable.
That sure didn't stop my call to summarize from trying though.
Hey, not the worst talk I've been to!
In Summary
Always, in short, to save time and money, consider doubling or tripling the speed of the audio you want to transcribe. The trade-off is, as always, fidelity, but it’s not an insignificant savings.
Simple, fast, and surprisingly effective.
TL;DR
- OpenAI charges for transcriptions based on audio duration (
whisper-1
) or tokens (gpt-4o-transcribe
). - You can speed up audio with
ffmpeg
before uploading to save time and money. - This reduces audio tokens (or duration), lowering your bill.
- 2x or 3x speed works well.
- 4x speed? Probably too much—but fun to try.
If you find problems with my math, have questions, found a more rigorous study qualitatively comparing different output speeds please get in touch! Or if you thought this was so cool you want to hire me for something fun...
--Published on Tuesday, June 24th 2025. Read this post in Markdown or plain-text.
If you enjoyed this consider signing-up for my newsletter or hiring me