Speech Recognition API Reference

SpeechText.AI provides a simple REST API for fast, accurate, multilingual speech-to-text conversion for most common media formats. Our speech recognition API can be used to transcribe audio/video files stored on your hard drive or files accessible over public URLs (HTTP, FTP, Google Drive, Dropbox, etc.).

For optimal results, capture audio with a sampling rate of 16kHz or higher and use a lossless format to transmit audio. Do not re-sample audio. Submit the audio in its original format.

Each file has a size limit of 1 GB. If you want to process files larger than 1 GB, we recommend compressing the file before uploading it.

Supported languages: English (en-US), Chinese (zh-CN), German (de-DE), Spanish (es-ES), French (fr-FR), Dutch (nl-NL), Italian (it-IT), Russian (ru-RU), Portuguese (pt-PT), Turkish (tr-TR), Greek (el-GR), Vietnamese (vi-VN).

The base URL for all API requests:

https://api.speechtext.ai/

The following example shows how to use the speech recognition API to transcribe and summarize audio data:


import requests
import time
import json

secret_key = "SECRET_KEY"

# retrieve transcription results for the task
def get_results(config):
  # endpoint to check status of the transcription task
  endpoint = "https://api.speechtext.ai/results?"
  # use a loop to check if the task is finished
  while True:
    results = requests.get(endpoint, params=config).json()
    if "status" not in results:
      break
    print("Task status: {}".format(results["status"]))
    if results["status"] == 'failed':
      print("The task is failed: {}".format(results))
      break
    if results["status"] == 'finished':
      break
    # sleep for 15 seconds if the task has the status - 'processing'
    time.sleep(15)
  return results

# loads the audio into memory
with open("/path/to/your/file.m4a", mode="rb") as file:
  post_body = file.read()

# endpoint to start a transcription task
endpoint = "https://api.speechtext.ai/recognize?"
header = {'Content-Type': "application/octet-stream"}

# transcription task options
config = {
  "key" : secret_key,
  "language" : "en-US",
  "punctuation" : True,
  "speaker_detection": True,
  "format" : "m4a"
}

# send an audio transcription request
r = requests.post(endpoint, headers = header, params = config, data = post_body).json()

# get the id of the speech recognition task
task = r["id"]
print("Task ID: {}".format(task))

# get transcription results, summary, and highlights
config = {
  "key" : secret_key,
  "task" : task,
  "summary" : True,
  "summary_size" : 15,
  "highlights" : True,
  "max_keywords" : 10
}

transcription = get_results(config)
print("Transcription: {}".format(transcription))

# export your transcription in SRT or VTT format
config = {
  "key" : secret_key,
  "task" : task,
  "output" : "srt",
  "max_caption_words" : 15
}

subtitles = get_results(config)
print("Subtitles: {}".format(subtitles))
                        

# create transcription task
curl -H "Content-Type:application/octet-stream" --data-binary @/path/to/your/file.m4a "https://api.speechtext.ai/recognize?key=SECRET_KEY&language=en-US&punctuation=true&speaker_detection=true&format=m4a"

# retrieve transcription results
curl -X GET "https://api.speechtext.ai/results?key=SECRET_KEY&task=TASK_ID&summary=true&summary_size=15&highlights=true&max_keywords=10"

# get captions
curl -X GET "https://api.speechtext.ai/results?key=SECRET_KEY&task=TASK_ID&output=srt&max_caption_words=10"

# process public URL
curl -X GET "https://api.speechtext.ai/recognize?key=SECRET_KEY&url=PUBLIC_URL&language=en-US&punctuation=true&speaker_detection=true&format=mp3"
                        

<?php

$secret_key = "SECRET_KEY";

# loads the audio
$filesize = filesize('/path/to/your/file.m4a');
$fp = fopen('/path/to/your/file.m4a', 'rb');
// read the entire file into a binary string
$binary = fread($fp, $filesize);

# endpoint and options to start a transcription task
$endpoint = "https://api.speechtext.ai/recognize?key=".$secret_key."&language=en-US&punctuation=true&speaker_detection=true&format=m4a";
$header = array('Content-type: application/octet-stream');

# curl connection initialization
$ch = curl_init();

# curl options
curl_setopt_array($ch, array(
    CURLOPT_URL => $endpoint,
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_POST => true,
    CURLOPT_HEADER => false,
    CURLOPT_HTTPHEADER => $header,
    CURLOPT_POSTFIELDS => $binary,
    CURLOPT_FOLLOWLOCATION => true
));

# send an audio transcription request
$body = curl_exec($ch);

if (curl_errno($ch))
{
    echo "CURL error: ".curl_error($ch);
}
else
{
    # parse JSON results
    $r = json_decode($body, true);
    # get the id of the speech recognition task
    $task = $r['id'];
    echo "Task ID: ".$task."\r\n";
    
    # endpoint to check status of the transcription task and retrieve results
    $endpoint = "https://api.speechtext.ai/results?key=".$secret_key."&task=".$task."&summary=true&summary_size=15&highlights=true&max_keywords=15";
    curl_setopt_array($ch, array(
        CURLOPT_URL => $endpoint,
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_POST => false,
        CURLOPT_HEADER => false,
        CURLOPT_FOLLOWLOCATION => true
    ));
    echo "Get transcription results, summary, and highlights\r\n";
    # use a loop to check if the task is finished
    while (true)
    {
        $body = curl_exec($ch);
        $results = json_decode($body, true);
        echo "Task status: ".$results['status']."\r\n";
        if (!array_key_exists('status', $results))
        {
            break;
        }
        if ($results['status'] == 'failed')
        {
            echo "The task is failed!\r\n";
        }
        if ($results['status'] == 'finished')
        {
            break;
        }
        # sleep for 15 seconds if the task has the status - 'processing'
        sleep(15);
    }
    print_r($results);
}

curl_close($ch);
                        

import java.net.*;
import java.io.*;
import java.util.concurrent.TimeUnit;
import org.json.*;


public class Transcriber {

    public static void main(String[] args) throws Exception {
        String secret_key = "SECRET_KEY";
        HttpURLConnection conn;
        
        // endpoint and options to start a transcription task
        URL endpoint = new URL("https://api.speechtext.ai/recognize?key=" + secret_key +"&language=en-US&punctuation=true&speaker_detection=true&format=m4a");
        
        // loads the audio into memory
        File file = new File("/path/to/your/file.m4a");
        RandomAccessFile f = new RandomAccessFile(file, "r");
        long sz = f.length();
        byte[] post_body = new byte[(int) sz];
        f.readFully(post_body);
        f.close();
        
        // send an audio transcription request
        conn = (HttpURLConnection) endpoint.openConnection();
        conn.setRequestMethod("POST");
        conn.setRequestProperty("Content-Type", "application/octet-stream");
        
        conn.setDoOutput(true);
        conn.connect();
        OutputStream os = conn.getOutputStream();
        os.write(post_body);
        os.flush();
        os.close();
        
        int responseCode = conn.getResponseCode();
        
        if (responseCode == 200) {
            
            BufferedReader in = new BufferedReader(new InputStreamReader(conn.getInputStream()));
            String line;
            StringBuffer response = new StringBuffer();
            while ((line = in .readLine()) != null) {
                response.append(line);
            } in .close();
            String result = response.toString();
            JSONObject json = new JSONObject(result);
            // get the id of the speech recognition task
            String task = json.getString("id");
            System.out.println("Task ID: " + task);
            // endpoint to check status of the transcription task
            URL res_endpoint = new URL("https://api.speechtext.ai/results?key=" +secret_key + "&task=" + task + "&summary=true&summary_size=15&highlights=true&max_keywords=15");
            System.out.println("Get transcription results, summary, and highlights");
            // use a loop to check if the task is finished
            JSONObject results;
            while (true) {
                conn = (HttpURLConnection) res_endpoint.openConnection();
                conn.setRequestMethod("GET");
                in = new BufferedReader(new InputStreamReader(conn.getInputStream()));
                response = new StringBuffer();
                String res;
                while ((res = in .readLine()) != null) {
                    response.append(res);
                } in .close();
                results = new JSONObject(response.toString());
                System.out.println("Task status: " + results.getString("status"));
                if (results.getString("status").equals("failed")) {
                    System.out.println("Failed to transcribe!");
                    break;
                }
                if (results.getString("status").equals("finished")) {
                    System.out.println(results);
                    break;
                }
                // sleep for 15 seconds if the task has the status - 'processing'
                TimeUnit.SECONDS.sleep(15);
            }
            
        } else {
            
            System.out.println("Failed to transcribe!");
        }
    }
}
                        

Obtain an API Key

Every request to the SpeechText.AI API must include a secret key. If you do not have an API key, please subscribe to one of our pricing plans or sign up to obtain a free API key for non-commercial use.

Start a transcription task

To transcribe audio or video files, you need to send a request to the recognize endpoint. The endpoint supports POST and GET requests. POST request body should include binary file content with the content-type: application/octet-stream header. GET request supports the use of public URLs (e.g. shared Google Drive or Dropbox files). Links to videos hosted on platforms like YouTube/Vimeo are not valid because they are not direct download links.

When making a POST request to the recognize endpoint, you can include the following parameters:

Parameter Description Example Required
key Your secret API key. 01201b3qdb30480cbc0d61608ef239d1 Yes
language The language of the supplied file as a BCP-47 language tag. The default value is en-US. fr-FR No
format The format of the file to process. If it is not specified the file format will be detected automatically. mp3 No
punctuation If true, adds punctuation to speech recognition results. The default false value does not add punctuation to transcription results. true No
speaker_detection If true, enables speaker detection for each recognized word in the recognition result using a speaker tag number. The default value is false. true No

When making a GET request to the recognize endpoint, you can include the following parameters:

Parameter Description Example Required
key Your secret API key. 01201b3qdb30480cbc0d61608ef239d1 Yes
url A URL that points to your audio file (e.g. public weblink, shared Google Drive or Dropbox file). https://drive.google.com/file/d/18KHbC4_t3SKNbziEvQxOsOSCVOBJQ2W7/view?usp=sharing Yes
language The language of the supplied file as a BCP-47 language tag. The default value is en-US. en-US No
format The format of the file to process. If it is not specified the file format will be detected automatically. mp3 No
punctuation If true, adds punctuation to speech recognition results. The default false value does not add punctuation to transcription results. true No
speaker_detection If true, enables speaker detection for each recognized word in the recognition result using a speaker tag number. The default value is false. true No

After a successful POST/GET request, the speech recognition API will respond with the following JSON response:


{ 
    "status": "processing",
    "created_at": "2020-10-20 13:15:34",
    "id": "151d8043-cd20-442b-923b-64d6e633abfd"
}
                

The response contains the status of the new transcription task (processing) and the task id (151d8043-cd20-442b-923b-64d6e633abfd). You will need the id value to make GET requests against the API to get the result of your transcription as it completes.

Get the transcription result

To get the transcription result, you will have to make repeated GET requests to the results endpoint until the task status is finished or failed.

The API will respond with the following JSON response once the task status is set finished:


{
  "status": "finished",
  "remaining seconds": 3346,
  "results": {
    "transcript": "Social networks are huge nowadays. I live in France, so I specifically love Facebook because I can keep in contact with my family back home, but also when I am at home, I can keep in contact with friends. I haven't seen a long time. I have cousins who live far away and it's nice to see pictures of what they're up to so. My cousins have children now and they're always posting pictures of babies on Facebook. I get to see how the babies are growing...",
    "word_time_offsets": [
      {
        "word": "Social",
        "speaker": 0,
        "end_time": 1.89,
        "confidence": 1,
        "start_time": 1.53
      },
      {
        "word": "networks",
        "speaker": 0,
        "end_time": 2.61,
        "confidence": 0.9908,
        "start_time": 1.89
      },
      {
        "word": "are",
        "speaker": 0,
        "end_time": 3.599847,
        "confidence": 0.998155,
        "start_time": 2.67
      },
      {
        "word": "huge",
        "speaker": 0,
        "end_time": 4.41,
        "confidence": 1,
        "start_time": 3.63
      },
      {
        "word": "nowadays.",
        "speaker": 0,
        "end_time": 5.31,
        "confidence": 0.993838,
        "start_time": 4.56042
      },
      ...
    ]
  }
}
                

Depending on request parameters, the response will include the following fields:

word_time_offsets - word-specific information for recognized words: detected words word with corresponding time offsets start_time/end_time and confidence scores. If the speaker_detection option was set as true, a distinct integer value will be assigned for every speaker within the audio;

transcript - the transcription text for your audio;

remaining seconds - remaining seconds on your account balance after this transcription task.

When making a GET request to the results endpoint, you can include the following parameters:

Parameter Description Example Required
key Your secret API key. 01201b3qdb30480cbc0d61608ef239d1 Yes
task The unique id of your transcription task. 151d8043-cd20-442b-923b-64d6e633abfd Yes

Auto-detecting keywords

The API automatically extracts the most frequent and most important keywords from your recording. This feature can be used to summarize the transcription text and understand the main topics discussed.

To enable automatic transcript highlights, you need to set the highlights option as true when you send a GET request to the results endpoint. Also, you can use the max_keywords option to specify the maximum count of the extracted keywords. For more information about the results endpoint, see how to get the transcription result section.

https://api.speechtext.ai/results?highlights=true&max_keywords=30

The automatic transcript highlights feature will tag every keyword in the transcription text with the special tag kw.


{
  "status": "finished",
  "remaining seconds": 3346,
  "results": {
    "transcript": "<kw>Social networks</kw> are <kw>huge nowadays</kw>. I live in France, so I specifically love <kw>Facebook</kw> because I can keep in <kw>contact</kw> with my family back home, but also when I am at home, I can keep in <kw>contact</kw> with friends. I haven't seen a <kw>long time</kw>. I have cousins who live far away and it's nice to see pictures of what they're up to so. My cousins have children now and they're always posting pictures of babies on <kw>Facebook</kw>. I get to see how the babies are growing...",
  ...
}
                 

Audio summarization

The speech summarization feature automatically detects the most important sentences from audio or video content and generates an accurate extractive summary of the transcription text.

To generate an automatic summary of the transcription text, you can include the following parameters to the results endpoint:

Parameter Description Example Required
summary If set to true, the summary of the transcription text will be generated. The default value is false. true No
summary_size Integer number that determines the percent of sentences of the original transcription text to be chosen for the summary generation. The default value is 15. 10 No
summary_words Determines how many words will the output summary contain. If the summary_words parameter is provided, the summary_size value will be ignored. 100 No
The minimum number of sentences in the transcription text for summary generation is 5. If the punctuation parameter is set to false or omitted, the summary won't be created.

You will get a response like the JSON response below:


{
  "status": "finished",
  "remaining seconds": 3346,
  "results": {
    ...
      "summary": "Social networks are huge nowadays.
                 My cousins have children now and they're always posting pictures of babies on Facebook.
                 Twitter is another huge social network.
                 Who don't know how to use Twitter that they like Facebook better, but social media is huge right now and it's really cool."
  }
}
                 

Export in SRT or VTT format

You can export your transcription results in SRT or VTT format. Include the following parameters to the results endpoint:

Parameter Description Example Required
output If set to srt or vtt, the API will return the caption output for a transcription task. srt No
max_caption_words The maximum number of transcribed words per caption. The default value is 15. 20 No
If the output parameter is set to srt or vtt, all other options at the results endpoint will be ignored.

The API will output a plain-text response in the following format:


1
00:00:01,510 --> 00:00:05,300
Social networks are huge nowadays.

2
00:00:05,310 --> 00:00:11,000
I live in France, so I specifically love Facebook because I can keep in

3
00:00:11,010 --> 00:00:16,400
contact with my family back home, but also when I am at home, I can

4
00:00:16,410 --> 00:00:18,400
keep in contact with friends
5
00:00:18,710 --> 00:00:20,900
I haven't seen a long time.

6
00:00:21,610 --> 00:00:28,800
I have cousins who live far away and it's nice to see pictures of
...