- transcriptions
mp3, wav, mp3, m4a, flac, ogg, opus, mp4, mov, avi, and mkv. Furthermore, the API supports sending direct links to the files and neural speaker diarization.
Transcriptions
The transcriptions API takes as input the audio (or video) file you want to transcribe (or the link to that file) and the desired output file format for the transcription of the audio. The model supports the following output formats (json, text, srt, verbose_json, vtt). Read more abouit the output formats on the Response Formats page.
Supported Languages
We currently support the following languages through the/transcriptions endpoint:
Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh.
While the model was trained on 98 languages, the list above only contains languages that exceeded 50% word error rate (WER). View the entire language list and their codes in the ISO-639-1 format on the Supported Languages page.
Timestamp Granularities
To get detailed timing information alongside your transcription, you can use thetimestamp_granularities[] parameter. This feature requires setting the response_format to verbose_json. For more details, visit the API Documentation.
Currently, two levels of granularity are available:
- 
segmentGranularity:- How to request: Include timestamp_granularities[]=segmentin your API request (along withresponse_format=verbose_json).
- What it returns: The verbose_jsonresponse will contain asegmentsarray. Each object in this array represents a larger chunk or segment of the transcribed audio (30 seconds in length). Each segment object includes itsstarttime,endtime, and the transcribedtextfor that specific chunk.
 
- How to request: Include 
- 
wordGranularity:- How to request: Include timestamp_granularities[]=wordin your API request (along withresponse_format=verbose_json).
- What it returns: This provides the most detailed timing information. The verbose_jsonresponse will contain:- A wordsarray: Each object in this array details an individual recognized word, including theworditself, its precisestarttime, and itsendtime in seconds.
- A segmentsarray: Even when requestingwordgranularity, the response still includes thesegmentsarray described above. This gives you both fine-grained word timing and the broader segment structure in a single response.
 
- A 
 
- How to request: Include 
- Use segmentif you only need timing for larger chunks of speech.
- Use wordif you need precise start/end times for individual words. Requestingwordgranularity conveniently provides both thewordsandsegmentsarrays.
- Remember, both options require response_format=verbose_json.
Response Formats
The Nexara Transcription API allows you to specify the format in which you want to receive the transcription results. By using theresponse_format parameter in your request, you can tailor the output to best suit your needs, whether you require simple text, structured data for programmatic use, or ready-to-use subtitle files. Refer to the API Documentation for examples.
The API supports the following output formats:
- json: Returns a standard JSON object containing the transcribed text.
- text: Returns the transcription as a single plain text string.
- verbose_json: Returns a detailed JSON object containing the text, language, duration, and potentially segment and word-level timestamps (if requested via- timestamp_granularities[]).
- srt: Returns the transcription formatted as an SRT subtitle file.
- vtt: Returns the transcription formatted as a WebVTT subtitle file.
When getting an answer in the subtitle format (
srt or vtt), do not use response.text if using python’s request library. Use response.json instead.Speaker Diarization
Nexara API supports neural speaker diarization. This feature identifies different speakers in an audio file and attributes transcribed segments to each one. To get started, simply set thetask parameter to "diarize" in your API request.
Furthermore, you can pass the num_speakers parameter to specify the number of speakers you expect in the audio file. Note that this parameter does not guarantee the correct number of speakers but can guide the model.
The model also supports three diarization settings:
- general: Default setting for general use.
- meeting: Optimized for meetings.
- telephonic: Optimized for telephonic conversations.
segments array. Each object within this array will have a speaker field (e.g., “speaker_0”, “speaker_1”) identifying who spoke during that specific audio segment, along with their start time, end time, and the text they spoke.
The output will always be in the following format, no matter the response_format setting:
Note that Nexara doesn’t support channel diarization. This means that the API
will ignore channel information and identify speakers based on the audio
content alone.