Learn how to turn your audio into text. Hint: you can copy the entire docs page into ChatGPT by pressing the button above.
transcriptionsmp3, wav, mp3, m4a, flac, ogg, opus, mp4, mov, avi, and mkv. Furthermore, the API supports sending direct links to the files and neural speaker diarization.
json, text, srt, verbose_json, vtt). Read more abouit the output formats on the Response Formats page.
/transcriptions endpoint:
Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh.
While the model was trained on 98 languages, the list above only contains languages that exceeded 50% word error rate (WER). View the entire language list and their codes in the ISO-639-1 format on the Supported Languages page.
timestamp_granularities[] parameter. This feature requires setting the response_format to verbose_json. For more details, visit the API Documentation.
Currently, two levels of granularity are available:
segment Granularity:
timestamp_granularities[]=segment in your API request (along with response_format=verbose_json).verbose_json response will contain a segments array. Each object in this array represents a larger chunk or segment of the transcribed audio (30 seconds in length). Each segment object includes its start time, end time, and the transcribed text for that specific chunk.word Granularity:
timestamp_granularities[]=word in your API request (along with response_format=verbose_json).verbose_json response will contain:
words array: Each object in this array details an individual recognized word, including the word itself, its precise start time, and its end time in seconds.segments array: Even when requesting word granularity, the response still includes the segments array described above. This gives you both fine-grained word timing and the broader segment structure in a single response.segment if you only need timing for larger chunks of speech.word if you need precise start/end times for individual words. Requesting word granularity conveniently provides both the words and segments arrays.response_format=verbose_json.response_format parameter in your request, you can tailor the output to best suit your needs, whether you require simple text, structured data for programmatic use, or ready-to-use subtitle files. Refer to the API Documentation for examples.
The API supports the following output formats:
json: Returns a standard JSON object containing the transcribed text.text: Returns the transcription as a single plain text string.verbose_json: Returns a detailed JSON object containing the text, language, duration, and potentially segment and word-level timestamps (if requested via timestamp_granularities[]).srt: Returns the transcription formatted as an SRT subtitle file.vtt: Returns the transcription formatted as a WebVTT subtitle file.srt or vtt), do not use response.text if using python’s request library. Use response.json instead.task parameter to "diarize" in your API request.
Furthermore, you can pass the num_speakers parameter to specify the number of speakers you expect in the audio file. Note that this parameter does not guarantee the correct number of speakers but can guide the model.
The model also supports three diarization settings:
general: Default setting for general use.meeting: Optimized for meetings.telephonic: Optimized for telephonic conversations.segments array. Each object within this array will have a speaker field (e.g., “speaker_0”, “speaker_1”) identifying who spoke during that specific audio segment, along with their start time, end time, and the text they spoke.
The output will always be in the following format, no matter the response_format setting: