Audio APIConnected models

Speech & Audio API

Transcribe speech, synthesize ElevenLabs voice, and generate Suno music through one unified API. Access leading STT, TTS, and audio generation models — one key, unified billing, no per-provider wiring.

View audio models API Dashboard

WhisperGPT-4o AudioElevenLabsSuno

Audio preview

Speech, voice, and generated audio on one billing layer.

Transcriptions, voice synthesis, sound effects, and music generation all plug into the same API key and observability surface.

Task families

STT, TTS, and generative audio in a single modality layer.

Sync + Async

Request modes

Short clips stay synchronous while long jobs use the same async flow as video.

1 Key

Auth layer

Reuse the same key across image, video, routing, and audio workloads.

Overview

Why one API for speech and audio

Audio tasks span a wide surface: converting speech to text, generating natural-sounding voice from copy, producing sound effects or music beds, and processing audio for downstream analysis. Each category has its own leading models — and each model has its own SDK, key, and billing structure.

Why one API for speech and audio

ImaRouter unifies all of them behind one endpoint. You pick the task type and model, we handle provider auth, format normalization, and delivery. The same API key that routes your video and image generation also handles your audio workload.

One key for STT, TTS, and audio generation across all providers
Unified request schema — no per-provider adapter or SDK
Audio output delivered as URL or base64 depending on model
Consistent error handling and retry across all audio providers
All usage logged and billed on a single invoice

Supported models and capabilities

ImaRouter's audio catalog covers three core task types. Each model has a distinct strength — picking the right model for the job reduces cost and improves output quality.

Whisper large-v3 — Accurate multilingual speech-to-text, 99 languages, low word error rate
GPT-4o Audio — Real-time speech understanding with contextual response generation
ElevenLabs Turbo — Natural TTS with emotion control, low latency, voice cloning support
OpenAI TTS-1 HD — High-fidelity voice synthesis, 6 built-in voices, OpenAI-native
ElevenLabs Sound Effects — Generative audio from text descriptions, ideal for UX and games
Suno v4 — Music generation from text prompts, full instrumental tracks

Capabilities

What you can build

Audio API at scale powers product patterns that require high-throughput processing or real-time interaction. These are the integration patterns teams are using in production.

What you can build

Audio API at scale powers product patterns that require high-throughput processing or real-time interaction. These are the integration patterns teams are using in production.

Meeting and call transcription: batch-process audio files with Whisper for CRM notes and summaries
Voice interfaces: add speech input/output to web or mobile apps with sub-second latency TTS
Content localization: synthesize narration in multiple languages from the same script
Podcast and video production: generate voice-over tracks from script copy at scale
Customer support automation: transcribe support calls and feed structured text to LLM pipelines
Game and app sound design: generate custom sound effects from text descriptions on demand

How audio requests work

Speech-to-text and short TTS requests are synchronous — submit audio or text and receive the result directly in the response. For longer TTS or music generation tasks, ImaRouter uses the same async job pattern as video generation.

Submit a generation job and receive a job ID. Poll for completion or set a webhook callback. Audio output is delivered as an MP3 or WAV file via CDN URL, or as a base64 string for inline use.

POST /v1/audio/transcriptions — synchronous STT, returns transcript text
POST /v1/audio/speech — synchronous TTS for short clips, returns audio URL
POST /v1/audio/generate/async — async for music and long-form synthesis
Webhook: receive completed audio payload at your endpoint without polling

Getting started

Send a POST request to the relevant audio endpoint with your input (audio file URL or text), model ID, and output preferences. Use your ImaRouter API key — the same one used for image and video generation.

For TTS, the request schema is compatible with the OpenAI audio API. Swap the base URL to immediately access ElevenLabs, GPT-4o Audio, and other models without changing your request code.

STT: POST /v1/audio/transcriptions — params: file (URL), model, language
TTS: POST /v1/audio/speech — params: input (text), model, voice, response_format
Music: POST /v1/audio/generate — params: prompt, model, duration_seconds
Base URL: https://api.imarouter.com/v1 — compatible with OpenAI audio client shape

FAQ

What audio file formats are supported for transcription?

ImaRouter accepts MP3, MP4, MPEG, MPGA, M4A, WAV, and WEBM for speech-to-text. Pass the file as a URL in the request body. Maximum file size is 25MB per request — split larger files before submission.

Can I clone a voice for TTS?

Voice cloning is available through ElevenLabs models. Upload a reference audio sample through the dashboard to create a custom voice ID, then reference that ID in your TTS requests. Cloned voice IDs are scoped to your account.

How are STT and TTS priced differently?

Speech-to-text is priced per minute of audio input. Text-to-speech is priced per thousand characters of input text. Music generation is priced per generated second. All audio usage is consolidated on your single monthly invoice.

Can I use audio alongside video generation in the same pipeline?

Yes. Since all modalities share the same API key and billing layer, you can call video generation and audio synthesis in the same application without managing separate accounts. A common pattern is generating a video clip and a voice-over track in parallel, then combining them in post.

Launch paths

Explore more routes from the same stack

Text-to-Image API

Generate stunning visuals through a unified API layer. Top models: GPT Image 2, Midjourney, Nano banana Pro — all via one clean interface.

Learn more

Text-to-Video API

Ship cinematic clips from prompts with one contract. Top models: Seedance 2.5, Seedance 2.0 mini, Happy Horse 1.1, Wan 2.7 — routing across providers for latency and cost.

Learn more

LLM Routing API

One key for every frontier and open-weight LLM. Route Claude Fable 5, Claude 4.8, Claude 4.7, GPT, Deepseek, and Qwen with automatic fallback, cost caps, and unified billing — zero per-provider wiring.

Learn more

Speech & Audio API

Speech, voice, and generated audio on one billing layer.

Task families

Request modes

Auth layer

Why one API for speech and audio

Why one API for speech and audio

Supported models and capabilities

What you can build

What you can build

How audio requests work

Getting started

FAQ

Related links and launch paths

Explore more routes from the same stack

Text-to-Image API

Text-to-Video API

LLM Routing API