Supported Formats and Limitations
- Supported formats:
mp3,wav,mp3,m4a,flac,ogg,opus,mp4,mov,avi,mkv. Furthermore, sending files by their URL is supported by the API. - Maximum file size:
1GB. If you need more, write to Support. - Minimum audio length: 0.3 seconds.
- Maximum audio length: 10 hours.
- Rate limit: 10 requests per second.
ffmpeg -i input.mp4 -vn -c:a aac -b:a 192k output.m4aIn this example, the video file input.mp4 is converted to output.m4a with a bitrate of 192 kbps.Examples
Click the picker on the right, which is eitherverbose_json_example, json_example or diarization_example to view the example responses.
Authorizations
Use your API key as a Bearer token in the Authorization header. Example: Authorization: Bearer nx-yourkey
Body
The audio file object (not filename) to transcribe, in one of the supported formats. Either file or url must be sent.
The URL of the audio file to transcribe, in one of the supported formats. This option is unsupported by the OpenAI SDK. Either file or url must be sent.
"https://upload.wikimedia.org/wikipedia/commons/a/a1/Gettysburg_by_Britton.ogg"
The task to perform. Currently only 'transcribe' and 'diarize' are supported. transcribe just transcribes the audio, while diarize also identifies different speakers in the audio and attributes transcribed segments to each one.
transcribe, diarize "transcribe"
ID of the model to use. Only whisper-1 is currently available.
"whisper-1"
The language of the input audio (ISO-639-1 format). Auto-detected if omitted.
"ru"
The format of the transcript output. srt and vtt formats will return ready-to-use formatted subtitles. If the diarize task is used, the response will always be a json object.
json, text, srt, verbose_json, vtt "verbose_json"
The number of speakers to detect. If not provided, the model will detect the number of speakers automatically. Please note that this variable does not guarantee the correct number of speakers but can guide the model. This parameter is ignored if the task is not diarize.
2
The config for the diarization model. general is the default setting. meeting is optimized for meetings, while telephonic is optimized for telephonic conversations. This parameter is ignored if the task is not diarize.
general, meeting, telephonic "telephonic"
Timestamp granularities to include. word requires response_format to be verbose_json.
segment, word "segment"
Response
Successful transcription or diarization response. The format depends on the 'response_format' parameter.
The full transcribed text.
The task that was performed. Currently, always returns transcribe.
The language of the input audio.
The duration of the input audio.
Segments of the transcribed text and their details.
Extracted words and their corresponding timestamps.