transcriptions
mp3
, wav
, mp3
, m4a
, flac
, ogg
, opus
, mp4
, mov
, avi
, and mkv
. Furthermore, the API supports sending direct links to the files and neural speaker diarization.
Transcriptions
The transcriptions API takes as input the audio (or video) file you want to transcribe (or the link to that file) and the desired output file format for the transcription of the audio. The model supports the following output formats (json
, text
, srt
, verbose_json
, vtt
). Read more abouit the output formats on the Response Formats page.
Supported Languages
We currently support the following languages through the/transcriptions
endpoint:
Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh.
While the model was trained on 98 languages, the list above only contains languages that exceeded 50% word error rate (WER). View the entire language list and their codes in the ISO-639-1 format on the Supported Languages page.
Timestamp Granularities
To get detailed timing information alongside your transcription, you can use thetimestamp_granularities[]
parameter. This feature requires setting the response_format
to verbose_json
. For more details, visit the API Documentation.
Currently, two levels of granularity are available:
-
segment
Granularity:- How to request: Include
timestamp_granularities[]=segment
in your API request (along withresponse_format=verbose_json
). - What it returns: The
verbose_json
response will contain asegments
array. Each object in this array represents a larger chunk or segment of the transcribed audio (30 seconds in length). Each segment object includes itsstart
time,end
time, and the transcribedtext
for that specific chunk.
- How to request: Include
-
word
Granularity:- How to request: Include
timestamp_granularities[]=word
in your API request (along withresponse_format=verbose_json
). - What it returns: This provides the most detailed timing information. The
verbose_json
response will contain:- A
words
array: Each object in this array details an individual recognized word, including theword
itself, its precisestart
time, and itsend
time in seconds. - A
segments
array: Even when requestingword
granularity, the response still includes thesegments
array described above. This gives you both fine-grained word timing and the broader segment structure in a single response.
- A
- How to request: Include
- Use
segment
if you only need timing for larger chunks of speech. - Use
word
if you need precise start/end times for individual words. Requestingword
granularity conveniently provides both thewords
andsegments
arrays. - Remember, both options require
response_format=verbose_json
.
Response Formats
The Nexara Transcription API allows you to specify the format in which you want to receive the transcription results. By using theresponse_format
parameter in your request, you can tailor the output to best suit your needs, whether you require simple text, structured data for programmatic use, or ready-to-use subtitle files. Refer to the API Documentation for examples.
The API supports the following output formats:
json
: Returns a standard JSON object containing the transcribed text.text
: Returns the transcription as a single plain text string.verbose_json
: Returns a detailed JSON object containing the text, language, duration, and potentially segment and word-level timestamps (if requested viatimestamp_granularities[]
).srt
: Returns the transcription formatted as an SRT subtitle file.vtt
: Returns the transcription formatted as a WebVTT subtitle file.
When getting an answer in the subtitle format (
srt
or vtt
), do not use response.text
if using python’s request
library. Use response.json
instead.Speaker Diarization
Nexara API supports neural speaker diarization. This feature identifies different speakers in an audio file and attributes transcribed segments to each one. To get started, simply set thetask
parameter to "diarize"
in your API request.
Furthermore, you can pass the num_speakers
parameter to specify the number of speakers you expect in the audio file. Note that this parameter does not guarantee the correct number of speakers but can guide the model.
The model also supports three diarization settings:
general
: Default setting for general use.meeting
: Optimized for meetings.telephonic
: Optimized for telephonic conversations.
segments
array. Each object within this array will have a speaker field (e.g., “speaker_0”, “speaker_1”) identifying who spoke during that specific audio segment, along with their start
time, end
time, and the text
they spoke.
The output will always be in the following format, no matter the response_format
setting:
Note that Nexara doesn’t support channel diarization. This means that the API
will ignore channel information and identify speakers based on the audio
content alone.