Quickstart
Learn how to turn your audio into text. Hint: you can copy the entire docs page into ChatGPT by pressing the button above.
The Audio API provides a single speech to text endpoint:
transcriptions
File uploads are limited to 1 GB (if you need larger file support, you can write to Support), and the following input file types are supported: mp3
, wav
, mp3
, m4a
, flac
, ogg
, opus
, mp4
, mov
, avi
, and mkv
. Furthermore, the API supports sending direct links to the files and neural speaker diarization.
Transcriptions
The transcriptions API takes as input the audio (or video) file you want to transcribe (or the link to that file) and the desired output file format for the transcription of the audio. The model supports the following output formats (json
, text
, srt
, verbose_json
, vtt
). Read more abouit the output formats on the Response Formats page.
By default, the response type will be json:
Supported Languages
We currently support the following languages through the /transcriptions
endpoint:
Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh.
While the model was trained on 98 languages, the list above only contains languages that exceeded 50% word error rate (WER). View the entire language list and their codes in the ISO-639-1 format on the Supported Languages page.
Timestamp Granularities
To get detailed timing information alongside your transcription, you can use the timestamp_granularities[]
parameter. This feature requires setting the response_format
to verbose_json
. For more details, visit the API Documentation.
Currently, two levels of granularity are available:
-
segment
Granularity:- How to request: Include
timestamp_granularities[]=segment
in your API request (along withresponse_format=verbose_json
). - What it returns: The
verbose_json
response will contain asegments
array. Each object in this array represents a larger chunk or segment of the transcribed audio (30 seconds in length). Each segment object includes itsstart
time,end
time, and the transcribedtext
for that specific chunk.
- How to request: Include
-
word
Granularity:- How to request: Include
timestamp_granularities[]=word
in your API request (along withresponse_format=verbose_json
). - What it returns: This provides the most detailed timing information. The
verbose_json
response will contain:- A
words
array: Each object in this array details an individual recognized word, including theword
itself, its precisestart
time, and itsend
time in seconds. - A
segments
array: Even when requestingword
granularity, the response still includes thesegments
array described above. This gives you both fine-grained word timing and the broader segment structure in a single response.
- A
- How to request: Include
In Summary:
- Use
segment
if you only need timing for larger chunks of speech. - Use
word
if you need precise start/end times for individual words. Requestingword
granularity conveniently provides both thewords
andsegments
arrays. - Remember, both options require
response_format=verbose_json
.
Here’s an example of an output:
Response Formats
The Nexara Transcription API allows you to specify the format in which you want to receive the transcription results. By using the response_format
parameter in your request, you can tailor the output to best suit your needs, whether you require simple text, structured data for programmatic use, or ready-to-use subtitle files. Refer to the API Documentation for examples.
The API supports the following output formats:
json
: Returns a standard JSON object containing the transcribed text.text
: Returns the transcription as a single plain text string.verbose_json
: Returns a detailed JSON object containing the text, language, duration, and potentially segment and word-level timestamps (if requested viatimestamp_granularities[]
).srt
: Returns the transcription formatted as an SRT subtitle file.vtt
: Returns the transcription formatted as a WebVTT subtitle file.
Speaker Diarization
Nexara API supports neural speaker diarization. This feature identifies different speakers in an audio file and attributes transcribed segments to each one. To get started, simply set the task
parameter to "diarize"
in your API request.
Furthermore, you can pass the num_speakers
parameter to specify the number of speakers you expect in the audio file. Note that this parameter does not guarantee the correct number of speakers but can guide the model.
The model also supports three diarization settings:
general
: Default setting for general use.meeting
: Optimized for meetings.telephonic
: Optimized for telephonic conversations.
Note that performing speaker diarization is more computationally intensive than just transcribing an audio file, so the delay for the response will be longer. Also, note that diarization costs twice as much as a regular transcription.
When you request diarization, the API will always return a JSON response. This response includes a segments
array. Each object within this array will have a speaker field (e.g., “speaker_0”, “speaker_1”) identifying who spoke during that specific audio segment, along with their start
time, end
time, and the text
they spoke.
The output will always be in the following format, no matter the response_format
setting:
Note that Nexara doesn’t support channel diarization. This means that the API will ignore channel information and identify speakers based on the audio content alone.