Paste any video URL or upload a file. Dokitscript transcribes it into text — then click Listen to hear the transcript as a natural AI voice MP3, powered by ElevenLabs.
TikTok · Instagram · YouTube · Facebook · X · LinkedIn · Last updated June 2026
Try text to speech free →How do I turn a video transcript into speech? Paste the video URL (or upload a file) into Dokitscript and wait for the transcript. The tool uses OpenAI Whisper to transcribe the spoken audio in 90+ languages and produce a full written transcript. Once the transcript is ready, click the Listen button — Dokitscript uses ElevenLabs' eleven_multilingual_v2 model to read the text aloud with a natural AI voice and generates a downloadable 128 kbps MP3. You can also translate the transcript into another language first, then generate audio in that language. Audio generation is available in approximately 29 languages and requires the Starter plan or higher ($4.99/month).
How it works
No software to install. Works entirely in your browser.
Paste a TikTok, Instagram, YouTube, Facebook, X, or LinkedIn video URL — or upload an audio or video file up to 50 MB (MP3, WAV, M4A, MP4, WebM).
Dokitscript transcribes the video using OpenAI Whisper. The spoken language is auto-detected — or you can select it manually. You get a timestamped, readable transcript.
Use the AI Translation feature to translate the transcript into another language before generating audio. Or skip this step and generate audio in the original language.
ElevenLabs reads the transcript text aloud with a natural AI voice. Download the result as a 128 kbps MP3 and listen anywhere.
Features
The source is always the transcript — not arbitrary text you type in.
Unlike generic TTS tools where you paste any text, the source here is the actual transcript Dokitscript extracts from your video. The speech stays faithful to the original content.
Audio is generated with ElevenLabs' eleven_multilingual_v2 model — one of the most natural-sounding multilingual AI voices available, across approximately 29 languages.
OpenAI Whisper handles the speech-to-text step. It auto-detects the spoken language and supports over 90 languages, so you can work with content from anywhere.
The built-in AI Translation feature (powered by Claude AI) lets you translate the transcript into a target language first — then generate the audio in that language, not the original.
The audio output is a standard MP3 file you can save, share, add to a podcast, import into a video editor, or use in any accessibility workflow.
Paste a URL from TikTok, Instagram Reels, YouTube (including Shorts), Facebook, X (Twitter) or LinkedIn. File upload also supported for local recordings.
Languages
Transcription and audio generation cover different language sets — here is the honest breakdown.
Dokitscript can transcribe speech in over 90 languages, including English, French, Spanish, Arabic, Chinese, Hindi, Japanese, Korean, Portuguese, German, Italian, and many more. The spoken language is detected automatically.
The MP3 voice output is powered by ElevenLabs and currently supports approximately 29 languages:
Note: transcription supports 90+ languages; audio generation supports ~29. If your target language is not in the audio list, you will still get the full translated text transcript.
Use cases
Anywhere a written transcript needs to be heard, not just read.
You transcribed a long interview or lecture but don't have time to read it. Generate an MP3 from the transcript and listen during your commute — the same content, audio format.
Turn a video's text subtitles into a spoken audio file. Useful for creating audio descriptions, narration tracks, or accessible versions of visual content.
Transcribe a foreign-language video, read the transcript, then listen to the AI-voiced MP3 to train your pronunciation and comprehension with authentic material.
Give visually impaired users or anyone who prefers audio an MP3 version of your video's transcript — without needing to record a separate narration.
Transcribe a video in French, translate the transcript into English, then generate an English MP3. The spoken audio reflects the translated text of the original video.
Transcribe an existing video to extract its script, refine the text in the transcript editor, then generate an AI-voiced MP3 as a draft before a professional recording.
Plans
Transcription and translation are available on every plan. Audio generation from transcript requires Starter or higher.
| Plan | Price | Transcriptions | Max video length | Audio from transcript (MP3) |
|---|---|---|---|---|
| Free | $0 | 5 / month | 3 minutes | Not available |
| Starter | $4.99 / mo | 200 / month | 8 minutes | 6 min / month |
| Pro | $14.99 / mo | Unlimited | 45 minutes | 60 min / month |
| Business | $79.99 / mo | Unlimited | 5 hours | 240 min / month |
Audio minutes are counted per generated MP3. Unused minutes do not roll over. See full pricing →
FAQ
Related tools
Free to start. Audio generation from $4.99/month. No software needed.
Get started free →