# Import this cURL into the HTTP Request node
# and add a n8n Binary File
curl --request POST \
  --url https://api.nexara.ru/api/v1/audio/transcriptions \
  --header 'Authorization: Bearer NEXARA_API_KEY' \
  --header 'Content-Type: multipart/form-data'
{
"task": "transcribe",
"language": "en",
"duration": 9.12,
"text": "The beach was a popular spot on a hot summer day. People were swimming in the ocean, building sandcastles, and playing beach volleyball.",
"segments": [
{
"id": 0,
"seek": 0,
"start": 0,
"end": 3.319999933242798,
"text": "The beach was a popular spot on a hot summer day.",
"tokens": [
50364,
440,
7534,
390,
257,
3743,
4008,
322,
257,
2368,
4266,
786,
13,
50530
],
"temperature": 0,
"avg_logprob": 0,
"compression_ratio": 0,
"no_speech_prob": 0
}
]
}

Supported Formats and Limitations

  • Supported formats: mp3, wav, mp3, m4a, flac, ogg, opus, mp4, mov, avi, mkv. Furthermore, sending files by their URL is supported by the API.
  • Maximum file size: 1GB. If you need more, write to Support.
  • Minimum audio length: 0.3 seconds.
  • Maximum audio length: 10 hours.
  • Rate limit: 10 requests per second.
To save bandwidth, it is recommended to convert video files to audio formats, for example using ffmpeg:ffmpeg -i input.mp4 -vn -c:a aac -b:a 192k output.m4aIn this example, the video file input.mp4 is converted to output.m4a with a bitrate of 192 kbps.

Examples

Click the picker on the right, which is either verbose_json_example, json_example or diarization_example to view the example responses.

Authorizations

Authorization
string
header
required

Use your API key as a Bearer token in the Authorization header. Example: Authorization: Bearer nx-yourkey

Body

multipart/form-data
file
file | null

The audio file object (not filename) to transcribe, in one of the supported formats. Either file or url must be sent.

url
string | null

The URL of the audio file to transcribe, in one of the supported formats. This option is unsupported by the OpenAI SDK. Either file or url must be sent.

Example:

"https://upload.wikimedia.org/wikipedia/commons/a/a1/Gettysburg_by_Britton.ogg"

task
enum<string>
default:transcribe

The task to perform. Currently only 'transcribe' and 'diarize' are supported. transcribe just transcribes the audio, while diarize also identifies different speakers in the audio and attributes transcribed segments to each one.

Available options:
transcribe,
diarize
Example:

"transcribe"

model
string
default:whisper-1

ID of the model to use. Only whisper-1 is currently available.

Example:

"whisper-1"

language
string | null

The language of the input audio (ISO-639-1 format). Auto-detected if omitted.

Example:

"ru"

response_format
enum<string>
default:json

The format of the transcript output. srt and vtt formats will return ready-to-use formatted subtitles. If the diarize task is used, the response will always be a json object.

Available options:
json,
text,
srt,
verbose_json,
vtt
Example:

"verbose_json"

num_speakers
number | null

The number of speakers to detect. If not provided, the model will detect the number of speakers automatically. Please note that this variable does not guarantee the correct number of speakers but can guide the model. This parameter is ignored if the task is not diarize.

Example:

2

diarization_setting
enum<string>
default:general

The config for the diarization model. general is the default setting. meeting is optimized for meetings, while telephonic is optimized for telephonic conversations. This parameter is ignored if the task is not diarize.

Available options:
general,
meeting,
telephonic
Example:

"telephonic"

timestamp_granularities[]
enum<string>
default:segment

Timestamp granularities to include. word requires response_format to be verbose_json.

Available options:
segment,
word
Example:

"segment"

Response

200
application/json

Successful transcription or diarization response. The format depends on the 'response_format' parameter.

text
string
required

The full transcribed text.

task
string | null

The task that was performed. Currently, always returns transcribe.

language
string | null

The language of the input audio.

duration
number | null

The duration of the input audio.

segments
object[] | null

Segments of the transcribed text and their details.

words
object[] | null

Extracted words and their corresponding timestamps.