Speech Recognition API Reference
SpeechText.AI provides a simple REST API for fast, accurate, multilingual speech-to-text conversion for most common media formats. Our speech recognition API can be used to transcribe audio/video files stored on your hard drive or files accessible over public URLs (HTTP, FTP, Google Drive, Dropbox, etc.).
For optimal results, capture audio with a sampling rate of 16kHz or higher and use a lossless format to transmit audio. Do not re-sample audio. Submit the audio in its original format.
Each file has a size limit of 1 GB. If you want to process files larger than 1 GB, we recommend compressing the file before uploading it.
Supported languages: English (en-US, en-GL, en-IN), German (de-DE, de-AT), French (fr-FR, fr-CA), Spanish (es-ES, es-MX), Dutch (nl-NL), Italian (it-IT), Portuguese (pt-PT, pt-BR), Russian (ru-RU), Chinese (zh-CN), Japanese (ja-JP), Korean (ko-KR), Arabic (ar-AE), Hindi (hi-IN), Polish (pl-PL), Swedish (sv-SE), Norwegian (no-NO), Danish (da-DK), Finnish (fi-FI), Turkish (tr-TR), Romanian (ro-RO), Czech (cs-CZ), Ukrainian (uk-UA), Greek (el-GR), Thai (th-TH), Indonesian (id-ID), Vietnamese (vi-VN), Filipino (fil-PH).
The base URL for all API requests:
https://api.speechtext.ai/
The following example shows how to use the speech recognition API to transcribe and summarize audio data:
import requests
import time
import json
secret_key = "SECRET_KEY"
# retrieve transcription results for the task
def get_results(config):
# endpoint to check status of the transcription task
endpoint = "https://api.speechtext.ai/results?"
# use a loop to check if the task is finished
while True:
results = requests.get(endpoint, params=config).json()
if "status" not in results:
break
print("Task status: {}".format(results["status"]))
if results["status"] == 'failed':
print("The task is failed: {}".format(results))
break
if results["status"] == 'finished':
break
# sleep for 15 seconds if the task has the status - 'processing'
time.sleep(15)
return results
# loads the audio into memory
with open("/path/to/your/file.m4a", mode="rb") as file:
post_body = file.read()
# endpoint to start a transcription task
endpoint = "https://api.speechtext.ai/recognize?"
header = {'Content-Type': "application/octet-stream"}
# transcription task options
config = {
"key" : secret_key,
"language" : "en-US",
"punctuation" : True,
"format" : "m4a"
}
# send an audio transcription request
r = requests.post(endpoint, headers = header, params = config, data = post_body).json()
# get the id of the speech recognition task
task = r["id"]
print("Task ID: {}".format(task))
# get transcription results, summary, and highlights
config = {
"key" : secret_key,
"task" : task,
"summary" : True,
"summary_size" : 15,
"highlights" : True,
"max_keywords" : 10
}
transcription = get_results(config)
print("Transcription: {}".format(transcription))
# export your transcription in SRT or VTT format
config = {
"key" : secret_key,
"task" : task,
"output" : "srt",
"max_caption_words" : 15
}
subtitles = get_results(config)
print("Subtitles: {}".format(subtitles))
# create transcription task
curl -H "Content-Type:application/octet-stream" --data-binary @/path/to/your/file.m4a "https://api.speechtext.ai/recognize?key=SECRET_KEY&language=en-US&punctuation=true&format=m4a"
# retrieve transcription results
curl -X GET "https://api.speechtext.ai/results?key=SECRET_KEY&task=TASK_ID&summary=true&summary_size=15&highlights=true&max_keywords=10"
# get captions
curl -X GET "https://api.speechtext.ai/results?key=SECRET_KEY&task=TASK_ID&output=srt&max_caption_words=10"
# process public URL
curl -X GET "https://api.speechtext.ai/recognize?key=SECRET_KEY&url=PUBLIC_URL&language=en-US&punctuation=true&format=mp3"
<?php
$secret_key = "SECRET_KEY";
# loads the audio
$filesize = filesize('/path/to/your/file.m4a');
$fp = fopen('/path/to/your/file.m4a', 'rb');
// read the entire file into a binary string
$binary = fread($fp, $filesize);
# endpoint and options to start a transcription task
$endpoint = "https://api.speechtext.ai/recognize?key=".$secret_key."&language=en-US&punctuation=true&format=m4a";
$header = array('Content-type: application/octet-stream');
# curl connection initialization
$ch = curl_init();
# curl options
curl_setopt_array($ch, array(
CURLOPT_URL => $endpoint,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_POST => true,
CURLOPT_HEADER => false,
CURLOPT_HTTPHEADER => $header,
CURLOPT_POSTFIELDS => $binary,
CURLOPT_FOLLOWLOCATION => true
));
# send an audio transcription request
$body = curl_exec($ch);
if (curl_errno($ch))
{
echo "CURL error: ".curl_error($ch);
}
else
{
# parse JSON results
$r = json_decode($body, true);
# get the id of the speech recognition task
$task = $r['id'];
echo "Task ID: ".$task."\r\n";
# endpoint to check status of the transcription task and retrieve results
$endpoint = "https://api.speechtext.ai/results?key=".$secret_key."&task=".$task."&summary=true&summary_size=15&highlights=true&max_keywords=15";
curl_setopt_array($ch, array(
CURLOPT_URL => $endpoint,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_POST => false,
CURLOPT_HEADER => false,
CURLOPT_FOLLOWLOCATION => true
));
echo "Get transcription results, summary, and highlights\r\n";
# use a loop to check if the task is finished
while (true)
{
$body = curl_exec($ch);
$results = json_decode($body, true);
echo "Task status: ".$results['status']."\r\n";
if (!array_key_exists('status', $results))
{
break;
}
if ($results['status'] == 'failed')
{
echo "The task is failed!\r\n";
}
if ($results['status'] == 'finished')
{
break;
}
# sleep for 15 seconds if the task has the status - 'processing'
sleep(15);
}
print_r($results);
}
curl_close($ch);
import java.net.*;
import java.io.*;
import java.util.concurrent.TimeUnit;
import org.json.*;
public class Transcriber {
public static void main(String[] args) throws Exception {
String secret_key = "SECRET_KEY";
HttpURLConnection conn;
// endpoint and options to start a transcription task
URL endpoint = new URL("https://api.speechtext.ai/recognize?key=" + secret_key +"&language=en-US&punctuation=true&format=m4a");
// loads the audio into memory
File file = new File("/path/to/your/file.m4a");
RandomAccessFile f = new RandomAccessFile(file, "r");
long sz = f.length();
byte[] post_body = new byte[(int) sz];
f.readFully(post_body);
f.close();
// send an audio transcription request
conn = (HttpURLConnection) endpoint.openConnection();
conn.setRequestMethod("POST");
conn.setRequestProperty("Content-Type", "application/octet-stream");
conn.setDoOutput(true);
conn.connect();
OutputStream os = conn.getOutputStream();
os.write(post_body);
os.flush();
os.close();
int responseCode = conn.getResponseCode();
if (responseCode == 200) {
BufferedReader in = new BufferedReader(new InputStreamReader(conn.getInputStream()));
String line;
StringBuffer response = new StringBuffer();
while ((line = in .readLine()) != null) {
response.append(line);
} in .close();
String result = response.toString();
JSONObject json = new JSONObject(result);
// get the id of the speech recognition task
String task = json.getString("id");
System.out.println("Task ID: " + task);
// endpoint to check status of the transcription task
URL res_endpoint = new URL("https://api.speechtext.ai/results?key=" +secret_key + "&task=" + task + "&summary=true&summary_size=15&highlights=true&max_keywords=15");
System.out.println("Get transcription results, summary, and highlights");
// use a loop to check if the task is finished
JSONObject results;
while (true) {
conn = (HttpURLConnection) res_endpoint.openConnection();
conn.setRequestMethod("GET");
in = new BufferedReader(new InputStreamReader(conn.getInputStream()));
response = new StringBuffer();
String res;
while ((res = in .readLine()) != null) {
response.append(res);
} in .close();
results = new JSONObject(response.toString());
System.out.println("Task status: " + results.getString("status"));
if (results.getString("status").equals("failed")) {
System.out.println("Failed to transcribe!");
break;
}
if (results.getString("status").equals("finished")) {
System.out.println(results);
break;
}
// sleep for 15 seconds if the task has the status - 'processing'
TimeUnit.SECONDS.sleep(15);
}
} else {
System.out.println("Failed to transcribe!");
}
}
}
Obtain an API Key
Every request to the SpeechText.AI API must include a secret key. If you do not have an API key, please subscribe to one of our pricing plans or sign up to obtain a free API key for non-commercial use.
Start a transcription task
To transcribe audio or video files, you need to send a request to the recognize
endpoint. The endpoint supports POST and GET requests. POST request body should include binary file content with the content-type: application/octet-stream
header. GET request supports the use of public URLs (e.g. shared Google Drive or Dropbox files). Links to videos hosted on platforms like YouTube/Vimeo are not valid because they are not direct download links.
When making a POST request to the recognize
endpoint, you can include the following parameters:
Parameter |
Description |
Example |
Required |
key |
Your secret API key. |
01201b3qdb30480cbc0d61608ef239d1 |
Yes |
language |
The language of the supplied file as a BCP-47 language tag. The default value is en-US . |
fr-FR |
No |
format |
The format of the file to process. If it is not specified the file format will be detected automatically. |
mp3 |
No |
punctuation |
If true , adds punctuation to speech recognition results. The default false value does not add punctuation to transcription results. |
true |
No |
When making a GET request to the recognize
endpoint, you can include the following parameters:
Parameter |
Description |
Example |
Required |
key |
Your secret API key. |
01201b3qdb30480cbc0d61608ef239d1 |
Yes |
url |
A URL that points to your audio file (e.g. public weblink, shared Google Drive or Dropbox file). |
https://drive.google.com/file/d/18KHbC4_t3SKNbziEvQxOsOSCVOBJQ2W7/view?usp=sharing |
Yes |
language |
The language of the supplied file as a BCP-47 language tag. The default value is en-US . |
en-US |
No |
format |
The format of the file to process. If it is not specified the file format will be detected automatically. |
mp3 |
No |
punctuation |
If true , adds punctuation to speech recognition results. The default false value does not add punctuation to transcription results. |
true |
No |
After a successful POST/GET request, the speech recognition API will respond with the following JSON response:
{
"status": "processing",
"created_at": "2020-10-20 13:15:34",
"id": "151d8043-cd20-442b-923b-64d6e633abfd"
}
The response contains the status of the new transcription task (processing
) and the task id (151d8043-cd20-442b-923b-64d6e633abfd
). You will need the id
(for POST) or the task_id
(for GET) value to make GET requests against the API to get the result of your transcription as it completes.
Get the transcription result
To get the transcription result, you will have to make repeated GET
requests to the results
endpoint until the task status is finished
or failed
.
The API will respond with the following JSON response once the task status is set finished
:
{
"status": "finished",
"remaining seconds": 3346,
"results": {
"transcript": "Social networks are huge nowadays. I live in France, so I specifically love Facebook because I can keep in contact with my family back home, but also when I am at home, I can keep in contact with friends. I haven't seen a long time. I have cousins who live far away and it's nice to see pictures of what they're up to so. My cousins have children now and they're always posting pictures of babies on Facebook. I get to see how the babies are growing...",
"word_time_offsets": [
{
"word": "Social",
"end_time": 1.89,
"confidence": 1,
"start_time": 1.53
},
{
"word": "networks",
"end_time": 2.61,
"confidence": 0.9908,
"start_time": 1.89
},
{
"word": "are",
"end_time": 3.599847,
"confidence": 0.998155,
"start_time": 2.67
},
{
"word": "huge",
"end_time": 4.41,
"confidence": 1,
"start_time": 3.63
},
{
"word": "nowadays.",
"end_time": 5.31,
"confidence": 0.993838,
"start_time": 4.56042
},
...
]
}
}
Depending on request parameters, the response will include the following fields:
word_time_offsets
- word-specific information for recognized words: detected words word
with corresponding time offsets start_time
/end_time
and confidence
scores;
transcript
- the transcription text for your audio;
remaining seconds
- remaining seconds on your account balance after this transcription task.
When making a GET request to the results
endpoint, you can include the following parameters:
Parameter |
Description |
Example |
Required |
key |
Your secret API key. |
01201b3qdb30480cbc0d61608ef239d1 |
Yes |
task |
The unique id of your transcription task. |
151d8043-cd20-442b-923b-64d6e633abfd |
Yes |
Auto-detecting keywords
The API automatically extracts the most frequent and most important keywords from your recording. This feature can be used to summarize the transcription text and understand the main topics discussed.
To enable automatic transcript highlights, you need to set the highlights
option as true
when you send a GET request to the results
endpoint. Also, you can use the max_keywords
option to specify the maximum count of the extracted keywords. For more information about the results
endpoint, see how to get the transcription result section.
https://api.speechtext.ai/results?highlights=true&max_keywords=30
The automatic transcript highlights feature will tag every keyword in the transcription text with the special tag kw
.
{
"status": "finished",
"remaining seconds": 3346,
"results": {
"transcript": "<kw>Social networks</kw> are <kw>huge nowadays</kw>. I live in France, so I specifically love <kw>Facebook</kw> because I can keep in <kw>contact</kw> with my family back home, but also when I am at home, I can keep in <kw>contact</kw> with friends. I haven't seen a <kw>long time</kw>. I have cousins who live far away and it's nice to see pictures of what they're up to so. My cousins have children now and they're always posting pictures of babies on <kw>Facebook</kw>. I get to see how the babies are growing...",
...
}
Audio summarization
The speech summarization feature automatically detects the most important sentences from audio or video content and generates an accurate extractive summary of the transcription text.
To generate an automatic summary of the transcription text, you can include the following parameters to the results
endpoint:
Parameter |
Description |
Example |
Required |
summary |
If set to true , the summary of the transcription text will be generated. The default value is false . |
true |
No |
summary_size |
Integer number that determines the percent of sentences of the original transcription text to be chosen for the summary generation. The default value is 15 . |
10 |
No |
summary_words |
Determines how many words will the output summary contain. If the summary_words parameter is provided, the summary_size value will be ignored. |
100 |
No |
The minimum number of sentences in the transcription text for summary generation is 5
. If the punctuation
parameter is set to false
or omitted, the summary won't be created.
You will get a response like the JSON response below:
{
"status": "finished",
"remaining seconds": 3346,
"results": {
...
"summary": "Social networks are huge nowadays.
My cousins have children now and they're always posting pictures of babies on Facebook.
Twitter is another huge social network.
Who don't know how to use Twitter that they like Facebook better, but social media is huge right now and it's really cool."
}
}
Export in SRT or VTT format
You can export your transcription results in SRT or VTT format. Include the following parameters to the results
endpoint:
Parameter |
Description |
Example |
Required |
output |
If set to srt or vtt , the API will return the caption output for a transcription task. |
srt |
No |
max_caption_words |
The maximum number of transcribed words per caption. The default value is 15 . |
20 |
No |
If the output
parameter is set to srt
or vtt
, all other options at the results
endpoint will be ignored.
The API will output a plain-text response in the following format:
1
00:00:01,510 --> 00:00:05,300
Social networks are huge nowadays.
2
00:00:05,310 --> 00:00:11,000
I live in France, so I specifically love Facebook because I can keep in
3
00:00:11,010 --> 00:00:16,400
contact with my family back home, but also when I am at home, I can
4
00:00:16,410 --> 00:00:18,400
keep in contact with friends
5
00:00:18,710 --> 00:00:20,900
I haven't seen a long time.
6
00:00:21,610 --> 00:00:28,800
I have cousins who live far away and it's nice to see pictures of
...