Audio APIConnected models

Speech & Audio API

Transcribe speech, synthesize voice, and generate audio through one unified API. Access leading STT, TTS, and audio generation models — one key, unified billing, no per-provider wiring.

WhisperGPT-4o AudioElevenLabsSuno
Audio preview

Speech, voice, and generated audio on one billing layer.

Transcriptions, voice synthesis, sound effects, and music generation all plug into the same API key and observability surface.

3

Task families

STT, TTS, and generative audio in a single modality layer.

Sync + Async

Request modes

Short clips stay synchronous while long jobs use the same async flow as video.

1 Key

Auth layer

Reuse the same key across image, video, routing, and audio workloads.

Overview

Why one API for speech and audio

Audio tasks span a wide surface: converting speech to text, generating natural-sounding voice from copy, producing sound effects or music beds, and processing audio for downstream analysis. Each category has its own leading models — and each model has its own SDK, key, and billing structure.

Why one API for speech and audio

Audio tasks span a wide surface: converting speech to text, generating natural-sounding voice from copy, producing sound effects or music beds, and processing audio for downstream analysis. Each category has its own leading models — and each model has its own SDK, key, and billing structure.

ImaRouter unifies all of them behind one endpoint. You pick the task type and model, we handle provider auth, format normalization, and delivery. The same API key that routes your video and image generation also handles your audio workload.

  • One key for STT, TTS, and audio generation across all providers
  • Unified request schema — no per-provider adapter or SDK
  • Audio output delivered as URL or base64 depending on model
  • Consistent error handling and retry across all audio providers
  • All usage logged and billed on a single invoice

Supported models and capabilities

ImaRouter's audio catalog covers three core task types. Each model has a distinct strength — picking the right model for the job reduces cost and improves output quality.

  • Whisper large-v3 — Accurate multilingual speech-to-text, 99 languages, low word error rate
  • GPT-4o Audio — Real-time speech understanding with contextual response generation
  • ElevenLabs Turbo — Natural TTS with emotion control, low latency, voice cloning support
  • OpenAI TTS-1 HD — High-fidelity voice synthesis, 6 built-in voices, OpenAI-native
  • ElevenLabs Sound Effects — Generative audio from text descriptions, ideal for UX and games
  • Suno v4 — Music generation from text prompts, full instrumental tracks

Capabilities

What you can build

Audio API at scale powers product patterns that require high-throughput processing or real-time interaction. These are the integration patterns teams are using in production.

03

What you can build

Audio API at scale powers product patterns that require high-throughput processing or real-time interaction. These are the integration patterns teams are using in production.

  • Meeting and call transcription: batch-process audio files with Whisper for CRM notes and summaries
  • Voice interfaces: add speech input/output to web or mobile apps with sub-second latency TTS
  • Content localization: synthesize narration in multiple languages from the same script
  • Podcast and video production: generate voice-over tracks from script copy at scale
  • Customer support automation: transcribe support calls and feed structured text to LLM pipelines
  • Game and app sound design: generate custom sound effects from text descriptions on demand
04

How audio requests work

Speech-to-text and short TTS requests are synchronous — submit audio or text and receive the result directly in the response. For longer TTS or music generation tasks, ImaRouter uses the same async job pattern as video generation.

Submit a generation job and receive a job ID. Poll for completion or set a webhook callback. Audio output is delivered as an MP3 or WAV file via CDN URL, or as a base64 string for inline use.

  • POST /v1/audio/transcriptions — synchronous STT, returns transcript text
  • POST /v1/audio/speech — synchronous TTS for short clips, returns audio URL
  • POST /v1/audio/generate/async — async for music and long-form synthesis
  • Webhook: receive completed audio payload at your endpoint without polling
05

Getting started

Send a POST request to the relevant audio endpoint with your input (audio file URL or text), model ID, and output preferences. Use your ImaRouter API key — the same one used for image and video generation.

For TTS, the request schema is compatible with the OpenAI audio API. Swap the base URL to immediately access ElevenLabs, GPT-4o Audio, and other models without changing your request code.

  • STT: POST /v1/audio/transcriptions — params: file (URL), model, language
  • TTS: POST /v1/audio/speech — params: input (text), model, voice, response_format
  • Music: POST /v1/audio/generate — params: prompt, model, duration_seconds
  • Base URL: https://api.imarouter.com/v1 — compatible with OpenAI audio client shape

FAQ

FAQ

What audio file formats are supported for transcription?

ImaRouter accepts MP3, MP4, MPEG, MPGA, M4A, WAV, and WEBM for speech-to-text. Pass the file as a URL in the request body. Maximum file size is 25MB per request — split larger files before submission.

Can I clone a voice for TTS?

Voice cloning is available through ElevenLabs models. Upload a reference audio sample through the dashboard to create a custom voice ID, then reference that ID in your TTS requests. Cloned voice IDs are scoped to your account.

How are STT and TTS priced differently?

Speech-to-text is priced per minute of audio input. Text-to-speech is priced per thousand characters of input text. Music generation is priced per generated second. All audio usage is consolidated on your single monthly invoice.

Can I use audio alongside video generation in the same pipeline?

Yes. Since all modalities share the same API key and billing layer, you can call video generation and audio synthesis in the same application without managing separate accounts. A common pattern is generating a video clip and a voice-over track in parallel, then combining them in post.

Launch paths

Related links and launch paths