Video to Speech: Turn Any Video Into Audio in Another Language

How do I turn a video into audio in another language? Paste the video URL (or upload a file) into Dokitscript, wait for the transcript, use the AI Translation feature to translate the text into your target language, then click Listen. Dokitscript uses ElevenLabs' eleven_multilingual_v2 model to generate a natural AI voice and produce a downloadable 128 kbps MP3 file. Transcription runs on OpenAI Whisper and supports 90+ languages; audio generation is available in approximately 29 languages and requires the Starter plan or higher.

How it works

How do I convert a video to speech in 4 steps?

No software to install. Works entirely in your browser.

1

Paste a URL or upload a file

Paste a TikTok, Instagram, YouTube, Facebook, X, or LinkedIn video URL — or upload an audio or video file up to 50 MB.

2

Transcription with OpenAI Whisper

Dokitscript transcribes the video in 90+ languages. The spoken language is auto-detected, or you can select it manually.

3

Translation into your target language

Use the AI Translation feature to translate the transcript into French, Spanish, Japanese, German, or any of the supported languages.

4

Click Listen — download your MP3

ElevenLabs generates a natural AI voice reading the translated text. Download the result as a 128 kbps MP3 file.

Features

What does video to speech include?

Everything from URL to MP3, in one tool.

🎙️

Natural AI voice via ElevenLabs

Audio is generated with ElevenLabs' eleven_multilingual_v2 model — one of the most natural-sounding multilingual AI voices available today.

🌍

Transcription in 90+ languages

OpenAI Whisper handles the speech-to-text step. It auto-detects the spoken language and supports over 90 languages for transcription.

🔤

AI translation built in

The translation step runs on Claude AI and produces natural-sounding translated text before it is converted to speech.

⬇️

Downloadable MP3 at 128 kbps

The audio output is a standard MP3 file you can download and use in podcasts, video editors, language learning apps, or accessibility tools.

🔗

All major platforms

Paste a URL from TikTok, Instagram Reels, YouTube Shorts, YouTube, Facebook, X (Twitter) or LinkedIn. File upload also works for local recordings.

📝

Text transcript included

You always get the full written transcript and the translated text alongside the MP3. Export as TXT or SRT at any time.

Languages

Which languages are supported for audio generation?

Transcription and audio generation cover different language sets — here is the honest breakdown.

Transcription — 90+ languages (OpenAI Whisper)

Dokitscript can transcribe speech in over 90 languages, including English, French, Spanish, Arabic, Chinese, Hindi, Japanese, Korean, Portuguese, German, Italian, and many more. The spoken language is detected automatically.

Audio generation — ~29 languages (ElevenLabs)

The MP3 voice output is powered by ElevenLabs and currently supports approximately 29 languages:

English French Spanish German Italian Portuguese Polish Turkish Russian Dutch Czech Arabic Chinese Japanese Korean Hindi Indonesian Filipino Swedish Bulgarian Romanian Greek Finnish Croatian Slovak Danish Tamil Ukrainian

Note: transcription supports 90+ languages; audio generation supports ~29. If your target language is not in the audio list, you will still get the translated text transcript.

Use cases

Who uses video to speech?

Anywhere spoken content needs to reach a different language audience.

Content repurposing

Turn a TikTok or Instagram Reel into a voiceover in another language. Great for creators who want to reach international audiences without re-recording.

Language learning

Transcribe a video in a foreign language, translate it, and listen to the MP3 to train your ear. Useful for students and self-learners working with real content.

Accessibility

Convert a written article or transcript into an audio file for users with visual impairments, or for commuters who prefer to listen rather than read.

Podcast production

Translate an episode into a second language and generate a voiceover track. Add it as a bonus episode for your international audience.

Training & education

Convert recorded lessons or company training videos into audio files in multiple languages for teams across different countries.

Voiceover drafts

Get an AI-voiced MP3 as a scratch track for video projects before hiring a voice actor, saving time in early production stages.

What video to speech does not do: It does not replace or dub audio within the original video file, synchronize the generated voice to on-screen lips (lip-sync), clone the original speaker's voice, or offer a choice between multiple AI voices. The output is a standalone MP3 audio file — a voiceover, not a dubbed video.

Plans

How many audio minutes do I get?

Transcription and translation are available on every plan. Audio generation requires Starter or higher.

Plan	Price	Transcriptions	Max video length	Audio generation (MP3)
Free	$0	5 / month	3 minutes	Not available
Starter	$4.99 / mo	200 / month	8 minutes	6 min / month
Pro	$14.99 / mo	Unlimited	45 minutes	60 min / month
Business	$79.99 / mo	Unlimited	5 hours	240 min / month

Audio minutes are counted per generated MP3. Unused minutes do not roll over. See full pricing →

FAQ

Video to Speech — Common Questions

Paste the video URL (or upload a file) into Dokitscript, wait for the transcript, use the AI Translation feature to translate the text into your target language, then click Listen. Dokitscript generates a natural AI voice MP3 using ElevenLabs and lets you download it. The process takes a couple of minutes end to end.

You can paste URLs from TikTok, Instagram Reels, YouTube (including Shorts), Facebook, X (Twitter), and LinkedIn. You can also upload local audio or video files (MP3, WAV, M4A, MP4, WebM — up to 50 MB).

Transcription supports 90+ languages via OpenAI Whisper. Audio generation (the MP3 voice output) supports approximately 29 languages via ElevenLabs eleven_multilingual_v2, including English, French, Spanish, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese, Japanese, Korean, Hindi, Indonesian, Filipino, Swedish, Bulgarian, Romanian, Greek, Finnish, Croatian, Slovak, Danish, Tamil, and Ukrainian.

The downloaded MP3 file is encoded at 128 kbps, which is suitable for voiceovers, podcasts, language learning and accessibility use cases.

No. The current feature generates a standalone MP3 audio file with a natural AI voice reading the translated text. It does not replace or synchronize audio within the original video, does not clone the original speaker's voice, and does not offer multiple voice choices. It is a voiceover audio file, not a dubbed video.

Audio generation requires the Starter plan or higher. The Free plan gives you transcription and AI text translation, but not the MP3 audio output. Starter includes 6 minutes of audio per month, Pro includes 60 minutes, and Business includes 240 minutes.

Yes. AI transcription converts speech to text. Video to speech goes further: it transcribes the video, translates the text into another language, and then converts that translated text back into spoken audio as an MP3 file. It is speech-to-text-to-speech, with a translation step in the middle.

Yes. You can transcribe and translate for free (Free plan: 5 transcriptions/month, AI translation included up to 3 uses/month). The MP3 audio generation step requires a Starter plan or higher, starting at $4.99/month.

Related tools

Turn Any Video Into Speech in Another Language