Transcribes audio from a given audio file. Also supports neural speaker diarization.
mp3
, wav
, mp3
, m4a
, flac
, ogg
, opus
, mp4
, mov
, avi
, mkv
. Furthermore, sending files by their URL is supported by the API.1GB
. If you need more, write to Support.ffmpeg -i input.mp4 -vn -c:a aac -b:a 192k output.m4a
In this example, the video file input.mp4
is converted to output.m4a
with a bitrate of 192 kbps.verbose_json_example
, json_example
or diarization_example
to view the example responses.Use your API key as a Bearer token in the Authorization header. Example: Authorization: Bearer nx-yourkey
The audio file object (not filename) to transcribe, in one of the supported formats. Either file
or url
must be sent.
The URL of the audio file to transcribe, in one of the supported formats. This option is unsupported by the OpenAI SDK. Either file
or url
must be sent.
"https://upload.wikimedia.org/wikipedia/commons/a/a1/Gettysburg_by_Britton.ogg"
The task to perform. Currently only 'transcribe' and 'diarize' are supported. transcribe
just transcribes the audio, while diarize
also identifies different speakers in the audio and attributes transcribed segments to each one.
transcribe
, diarize
"transcribe"
ID of the model to use. Only whisper-1
is currently available.
"whisper-1"
The language of the input audio (ISO-639-1 format). Auto-detected if omitted.
"ru"
The format of the transcript output. srt
and vtt
formats will return ready-to-use formatted subtitles. If the diarize
task is used, the response will always be a json object.
json
, text
, srt
, verbose_json
, vtt
"verbose_json"
The number of speakers to detect. If not provided, the model will detect the number of speakers automatically. Please note that this variable does not guarantee the correct number of speakers but can guide the model. This parameter is ignored if the task
is not diarize
.
2
The config for the diarization model. general
is the default setting. meeting
is optimized for meetings, while telephonic
is optimized for telephonic conversations. This parameter is ignored if the task
is not diarize
.
general
, meeting
, telephonic
"telephonic"
Timestamp granularities to include. word
requires response_format
to be verbose_json
.
segment
, word
"segment"
Successful transcription or diarization response. The format depends on the 'response_format' parameter.
The full transcribed text.
The task that was performed. Currently, always returns transcribe
.
The language of the input audio.
The duration of the input audio.
Segments of the transcribed text and their details.
Extracted words and their corresponding timestamps.