Obtain an API key Start a transcription task Get the transcription result Auto-detecting keywords Audio summarization Export in SRT or VTT format

Speech Recognition API Reference

SpeechText.AI provides a simple REST API for fast, accurate, multilingual speech-to-text conversion for most common media formats. Our speech recognition API can be used to transcribe audio/video files stored on your hard drive or files accessible over public URLs (HTTP, FTP, Google Drive, Dropbox, etc.).

For optimal results, capture audio with a sampling rate of 16kHz or higher and use a lossless format to transmit audio. Do not re-sample audio. Submit the audio in its original format.

Each file has a size limit of 1 GB. If you want to process files larger than 1 GB, we recommend compressing the file before uploading it.

Supported languages: English (en-US, en-GL, en-IN), German (de-DE, de-AT), French (fr-FR, fr-CA), Spanish (es-ES, es-MX), Dutch (nl-NL), Italian (it-IT), Portuguese (pt-PT, pt-BR), Russian (ru-RU), Chinese (zh-CN), Japanese (ja-JP), Korean (ko-KR), Arabic (ar-AE), Hindi (hi-IN), Polish (pl-PL), Swedish (sv-SE), Norwegian (no-NO), Danish (da-DK), Finnish (fi-FI), Turkish (tr-TR), Romanian (ro-RO), Czech (cs-CZ), Ukrainian (uk-UA), Greek (el-GR), Thai (th-TH), Indonesian (id-ID), Vietnamese (vi-VN), Filipino (fil-PH).

The base URL for all API requests:

https://api.speechtext.ai/

The following example shows how to use the speech recognition API to transcribe and summarize audio data:


import requests
import time
import json

secret_key = "SECRET_KEY"

# retrieve transcription results for the task
def get_results(config):
  # endpoint to check status of the transcription task
  endpoint = "https://api.speechtext.ai/results?"
  # use a loop to check if the task is finished
  while True:
    results = requests.get(endpoint, params=config).json()
    if "status" not in results:
      break
    print("Task status: {}".format(results["status"]))
    if results["status"] == 'failed':
      print("The task is failed: {}".format(results))
      break
    if results["status"] == 'finished':
      break
    # sleep for 15 seconds if the task has the status - 'processing'
    time.sleep(15)
  return results

# loads the audio into memory
with open("/path/to/your/file.m4a", mode="rb") as file:
  post_body = file.read()

# endpoint to start a transcription task
endpoint = "https://api.speechtext.ai/recognize?"
header = {'Content-Type': "application/octet-stream"}

# transcription task options
config = {
  "key" : secret_key,
  "language" : "en-US",
  "punctuation" : True,
  "format" : "m4a"
}

# send an audio transcription request
r = requests.post(endpoint, headers = header, params = config, data = post_body).json()

# get the id of the speech recognition task
task = r["id"]
print("Task ID: {}".format(task))

# get transcription results, summary, and highlights
config = {
  "key" : secret_key,
  "task" : task,
  "summary" : True,
  "summary_size" : 15,
  "highlights" : True,
  "max_keywords" : 10
}

transcription = get_results(config)
print("Transcription: {}".format(transcription))

# export your transcription in SRT or VTT format
config = {
  "key" : secret_key,
  "task" : task,
  "output" : "srt",
  "max_caption_words" : 15
}

subtitles = get_results(config)
print("Subtitles: {}".format(subtitles))


# create transcription task
curl -H "Content-Type:application/octet-stream" --data-binary @/path/to/your/file.m4a "https://api.speechtext.ai/recognize?key=SECRET_KEY&language=en-US&punctuation=true&format=m4a"

# retrieve transcription results
curl -X GET "https://api.speechtext.ai/results?key=SECRET_KEY&task=TASK_ID&summary=true&summary_size=15&highlights=true&max_keywords=10"

# get captions
curl -X GET "https://api.speechtext.ai/results?key=SECRET_KEY&task=TASK_ID&output=srt&max_caption_words=10"

# process public URL
curl -X GET "https://api.speechtext.ai/recognize?key=SECRET_KEY&url=PUBLIC_URL&language=en-US&punctuation=true&format=mp3"


<?php

$secret_key = "SECRET_KEY";

# loads the audio
$filesize = filesize('/path/to/your/file.m4a');
$fp = fopen('/path/to/your/file.m4a', 'rb');
// read the entire file into a binary string
$binary = fread($fp, $filesize);

# endpoint and options to start a transcription task
$endpoint = "https://api.speechtext.ai/recognize?key=".$secret_key."&language=en-US&punctuation=true&format=m4a";
$header = array('Content-type: application/octet-stream');

# curl connection initialization
$ch = curl_init();

# curl options
curl_setopt_array($ch, array(
    CURLOPT_URL => $endpoint,
    CURLOPT_RETURNTRANSFER => true,
    CURLOPT_POST => true,
    CURLOPT_HEADER => false,
    CURLOPT_HTTPHEADER => $header,
    CURLOPT_POSTFIELDS => $binary,
    CURLOPT_FOLLOWLOCATION => true
));

# send an audio transcription request
$body = curl_exec($ch);

if (curl_errno($ch))
{
    echo "CURL error: ".curl_error($ch);
}
else
{
    # parse JSON results
    $r = json_decode($body, true);
    # get the id of the speech recognition task
    $task = $r['id'];
    echo "Task ID: ".$task."\r\n";
    
    # endpoint to check status of the transcription task and retrieve results
    $endpoint = "https://api.speechtext.ai/results?key=".$secret_key."&task=".$task."&summary=true&summary_size=15&highlights=true&max_keywords=15";
    curl_setopt_array($ch, array(
        CURLOPT_URL => $endpoint,
        CURLOPT_RETURNTRANSFER => true,
        CURLOPT_POST => false,
        CURLOPT_HEADER => false,
        CURLOPT_FOLLOWLOCATION => true
    ));
    echo "Get transcription results, summary, and highlights\r\n";
    # use a loop to check if the task is finished
    while (true)
    {
        $body = curl_exec($ch);
        $results = json_decode($body, true);
        echo "Task status: ".$results['status']."\r\n";
        if (!array_key_exists('status', $results))
        {
            break;
        }
        if ($results['status'] == 'failed')
        {
            echo "The task is failed!\r\n";
        }
        if ($results['status'] == 'finished')
        {
            break;
        }
        # sleep for 15 seconds if the task has the status - 'processing'
        sleep(15);
    }
    print_r($results);
}

curl_close($ch);


import java.net.*;
import java.io.*;
import java.util.concurrent.TimeUnit;
import org.json.*;


public class Transcriber {

    public static void main(String[] args) throws Exception {
        String secret_key = "SECRET_KEY";
        HttpURLConnection conn;
        
        // endpoint and options to start a transcription task
        URL endpoint = new URL("https://api.speechtext.ai/recognize?key=" + secret_key +"&language=en-US&punctuation=true&format=m4a");
        
        // loads the audio into memory
        File file = new File("/path/to/your/file.m4a");
        RandomAccessFile f = new RandomAccessFile(file, "r");
        long sz = f.length();
        byte[] post_body = new byte[(int) sz];
        f.readFully(post_body);
        f.close();
        
        // send an audio transcription request
        conn = (HttpURLConnection) endpoint.openConnection();
        conn.setRequestMethod("POST");
        conn.setRequestProperty("Content-Type", "application/octet-stream");
        
        conn.setDoOutput(true);
        conn.connect();
        OutputStream os = conn.getOutputStream();
        os.write(post_body);
        os.flush();
        os.close();
        
        int responseCode = conn.getResponseCode();
        
        if (responseCode == 200) {
            
            BufferedReader in = new BufferedReader(new InputStreamReader(conn.getInputStream()));
            String line;
            StringBuffer response = new StringBuffer();
            while ((line = in .readLine()) != null) {
                response.append(line);
            } in .close();
            String result = response.toString();
            JSONObject json = new JSONObject(result);
            // get the id of the speech recognition task
            String task = json.getString("id");
            System.out.println("Task ID: " + task);
            // endpoint to check status of the transcription task
            URL res_endpoint = new URL("https://api.speechtext.ai/results?key=" +secret_key + "&task=" + task + "&summary=true&summary_size=15&highlights=true&max_keywords=15");
            System.out.println("Get transcription results, summary, and highlights");
            // use a loop to check if the task is finished
            JSONObject results;
            while (true) {
                conn = (HttpURLConnection) res_endpoint.openConnection();
                conn.setRequestMethod("GET");
                in = new BufferedReader(new InputStreamReader(conn.getInputStream()));
                response = new StringBuffer();
                String res;
                while ((res = in .readLine()) != null) {
                    response.append(res);
                } in .close();
                results = new JSONObject(response.toString());
                System.out.println("Task status: " + results.getString("status"));
                if (results.getString("status").equals("failed")) {
                    System.out.println("Failed to transcribe!");
                    break;
                }
                if (results.getString("status").equals("finished")) {
                    System.out.println(results);
                    break;
                }
                // sleep for 15 seconds if the task has the status - 'processing'
                TimeUnit.SECONDS.sleep(15);
            }
            
        } else {
            
            System.out.println("Failed to transcribe!");
        }
    }
}

Obtain an API Key

Every request to the SpeechText.AI API must include a secret key. If you do not have an API key, please subscribe to one of our pricing plans or sign up to obtain a free API key for non-commercial use.

Start a transcription task

To transcribe audio or video files, you need to send a request to the recognize endpoint. The endpoint supports POST and GET requests. POST request body should include binary file content with the content-type: application/octet-stream header. GET request supports the use of public URLs (e.g. shared Google Drive or Dropbox files). Links to videos hosted on platforms like YouTube/Vimeo are not valid because they are not direct download links.

When making a POST request to the recognize endpoint, you can include the following parameters:

Parameter	Description	Example	Required
key	Your secret API key.	`01201b3qdb30480cbc0d61608ef239d1`	Yes
language	The language of the supplied file as a BCP-47 language tag. The default value is `en-US`.	`fr-FR`	No
format	The format of the file to process. If it is not specified the file format will be detected automatically.	`mp3`	No
punctuation	If `true`, adds punctuation to speech recognition results. The default `false` value does not add punctuation to transcription results.	`true`	No

When making a GET request to the recognize endpoint, you can include the following parameters:

Parameter	Description	Example	Required
key	Your secret API key.	`01201b3qdb30480cbc0d61608ef239d1`	Yes
url	A URL that points to your audio file (e.g. public weblink, shared Google Drive or Dropbox file).	`https://drive.google.com/file/d/18KHbC4_t3SKNbziEvQxOsOSCVOBJQ2W7/view?usp=sharing`	Yes
language	The language of the supplied file as a BCP-47 language tag. The default value is `en-US`.	`en-US`	No
format	The format of the file to process. If it is not specified the file format will be detected automatically.	`mp3`	No
punctuation	If `true`, adds punctuation to speech recognition results. The default `false` value does not add punctuation to transcription results.	`true`	No

After a successful POST/GET request, the speech recognition API will respond with the following JSON response:


{ 
    "status": "processing",
    "created_at": "2020-10-20 13:15:34",
    "id": "151d8043-cd20-442b-923b-64d6e633abfd"
}

The response contains the status of the new transcription task (processing) and the task id (151d8043-cd20-442b-923b-64d6e633abfd). You will need the id (for POST) or the task_id (for GET) value to make GET requests against the API to get the result of your transcription as it completes.

Get the transcription result

To get the transcription result, you will have to make repeated GET requests to the results endpoint until the task status is finished or failed.

The API will respond with the following JSON response once the task status is set finished:


{
  "status": "finished",
  "remaining seconds": 3346,
  "results": {
    "transcript": "Social networks are huge nowadays. I live in France, so I specifically love Facebook because I can keep in contact with my family back home, but also when I am at home, I can keep in contact with friends. I haven't seen a long time. I have cousins who live far away and it's nice to see pictures of what they're up to so. My cousins have children now and they're always posting pictures of babies on Facebook. I get to see how the babies are growing...",
    "word_time_offsets": [
      {
        "word": "Social",
        "end_time": 1.89,
        "confidence": 1,
        "start_time": 1.53
      },
      {
        "word": "networks",
        "end_time": 2.61,
        "confidence": 0.9908,
        "start_time": 1.89
      },
      {
        "word": "are",
        "end_time": 3.599847,
        "confidence": 0.998155,
        "start_time": 2.67
      },
      {
        "word": "huge",
        "end_time": 4.41,
        "confidence": 1,
        "start_time": 3.63
      },
      {
        "word": "nowadays.",
        "end_time": 5.31,
        "confidence": 0.993838,
        "start_time": 4.56042
      },
      ...
    ]
  }
}

Depending on request parameters, the response will include the following fields:

word_time_offsets - word-specific information for recognized words: detected words word with corresponding time offsets start_time/end_time and confidence scores;

transcript - the transcription text for your audio;

remaining seconds - remaining seconds on your account balance after this transcription task.

When making a GET request to the results endpoint, you can include the following parameters:

Parameter	Description	Example	Required
key	Your secret API key.	`01201b3qdb30480cbc0d61608ef239d1`	Yes
task	The unique `id` of your transcription task.	`151d8043-cd20-442b-923b-64d6e633abfd`	Yes

Auto-detecting keywords

The API automatically extracts the most frequent and most important keywords from your recording. This feature can be used to summarize the transcription text and understand the main topics discussed.

To enable automatic transcript highlights, you need to set the highlights option as true when you send a GET request to the results endpoint. Also, you can use the max_keywords option to specify the maximum count of the extracted keywords. For more information about the results endpoint, see how to get the transcription result section.

https://api.speechtext.ai/results?highlights=true&max_keywords=30

The automatic transcript highlights feature will tag every keyword in the transcription text with the special tag kw.


{
  "status": "finished",
  "remaining seconds": 3346,
  "results": {
    "transcript": "<kw>Social networks</kw> are <kw>huge nowadays</kw>. I live in France, so I specifically love <kw>Facebook</kw> because I can keep in <kw>contact</kw> with my family back home, but also when I am at home, I can keep in <kw>contact</kw> with friends. I haven't seen a <kw>long time</kw>. I have cousins who live far away and it's nice to see pictures of what they're up to so. My cousins have children now and they're always posting pictures of babies on <kw>Facebook</kw>. I get to see how the babies are growing...",
  ...
}

Audio summarization

The speech summarization feature automatically detects the most important sentences from audio or video content and generates an accurate extractive summary of the transcription text.

To generate an automatic summary of the transcription text, you can include the following parameters to the results endpoint:

Parameter	Description	Example	Required
summary	If set to `true`, the summary of the transcription text will be generated. The default value is `false`.	`true`	No
summary_size	Integer number that determines the percent of sentences of the original transcription text to be chosen for the summary generation. The default value is `15`.	`10`	No
summary_words	Determines how many words will the output summary contain. If the `summary_words` parameter is provided, the `summary_size` value will be ignored.	`100`	No

The minimum number of sentences in the transcription text for summary generation is 5. If the punctuation parameter is set to false or omitted, the summary won't be created.

You will get a response like the JSON response below:


{
  "status": "finished",
  "remaining seconds": 3346,
  "results": {
    ...
      "summary": "Social networks are huge nowadays.
                 My cousins have children now and they're always posting pictures of babies on Facebook.
                 Twitter is another huge social network.
                 Who don't know how to use Twitter that they like Facebook better, but social media is huge right now and it's really cool."
  }
}

Export in SRT or VTT format

You can export your transcription results in SRT or VTT format. Include the following parameters to the results endpoint:

Parameter	Description	Example	Required
output	If set to `srt` or `vtt`, the API will return the caption output for a transcription task.	`srt`	No
max_caption_words	The maximum number of transcribed words per caption. The default value is `15`.	`20`	No

If the output parameter is set to srt or vtt, all other options at the results endpoint will be ignored.

The API will output a plain-text response in the following format:


1
00:00:01,510 --> 00:00:05,300
Social networks are huge nowadays.

2
00:00:05,310 --> 00:00:11,000
I live in France, so I specifically love Facebook because I can keep in

3
00:00:11,010 --> 00:00:16,400
contact with my family back home, but also when I am at home, I can

4
00:00:16,410 --> 00:00:18,400
keep in contact with friends
5
00:00:18,710 --> 00:00:20,900
I haven't seen a long time.

6
00:00:21,610 --> 00:00:28,800
I have cousins who live far away and it's nice to see pictures of
...