Speech & Audio API
Transcribe speech, synthesize voice, and generate audio through one unified API. Access leading STT, TTS, and audio generation models — one key, unified billing, no per-provider wiring.


Speech, voice, and generated audio on one billing layer.
Transcriptions, voice synthesis, sound effects, and music generation all plug into the same API key and observability surface.
Task families
STT, TTS, and generative audio in a single modality layer.
Request modes
Short clips stay synchronous while long jobs use the same async flow as video.
Auth layer
Reuse the same key across image, video, routing, and audio workloads.
Overview
Why one API for speech and audio
Audio tasks span a wide surface: converting speech to text, generating natural-sounding voice from copy, producing sound effects or music beds, and processing audio for downstream analysis. Each category has its own leading models — and each model has its own SDK, key, and billing structure.
Why one API for speech and audio
Audio tasks span a wide surface: converting speech to text, generating natural-sounding voice from copy, producing sound effects or music beds, and processing audio for downstream analysis. Each category has its own leading models — and each model has its own SDK, key, and billing structure.
ImaRouter unifies all of them behind one endpoint. You pick the task type and model, we handle provider auth, format normalization, and delivery. The same API key that routes your video and image generation also handles your audio workload.
- One key for STT, TTS, and audio generation across all providers
- Unified request schema — no per-provider adapter or SDK
- Audio output delivered as URL or base64 depending on model
- Consistent error handling and retry across all audio providers
- All usage logged and billed on a single invoice
Supported models and capabilities
ImaRouter's audio catalog covers three core task types. Each model has a distinct strength — picking the right model for the job reduces cost and improves output quality.
- Whisper large-v3 — Accurate multilingual speech-to-text, 99 languages, low word error rate
- GPT-4o Audio — Real-time speech understanding with contextual response generation
- ElevenLabs Turbo — Natural TTS with emotion control, low latency, voice cloning support
- OpenAI TTS-1 HD — High-fidelity voice synthesis, 6 built-in voices, OpenAI-native
- ElevenLabs Sound Effects — Generative audio from text descriptions, ideal for UX and games
- Suno v4 — Music generation from text prompts, full instrumental tracks
Capabilities
What you can build
Audio API at scale powers product patterns that require high-throughput processing or real-time interaction. These are the integration patterns teams are using in production.
What you can build
Audio API at scale powers product patterns that require high-throughput processing or real-time interaction. These are the integration patterns teams are using in production.
- Meeting and call transcription: batch-process audio files with Whisper for CRM notes and summaries
- Voice interfaces: add speech input/output to web or mobile apps with sub-second latency TTS
- Content localization: synthesize narration in multiple languages from the same script
- Podcast and video production: generate voice-over tracks from script copy at scale
- Customer support automation: transcribe support calls and feed structured text to LLM pipelines
- Game and app sound design: generate custom sound effects from text descriptions on demand
How audio requests work
Speech-to-text and short TTS requests are synchronous — submit audio or text and receive the result directly in the response. For longer TTS or music generation tasks, ImaRouter uses the same async job pattern as video generation.
Submit a generation job and receive a job ID. Poll for completion or set a webhook callback. Audio output is delivered as an MP3 or WAV file via CDN URL, or as a base64 string for inline use.
- POST /v1/audio/transcriptions — synchronous STT, returns transcript text
- POST /v1/audio/speech — synchronous TTS for short clips, returns audio URL
- POST /v1/audio/generate/async — async for music and long-form synthesis
- Webhook: receive completed audio payload at your endpoint without polling
Getting started
Send a POST request to the relevant audio endpoint with your input (audio file URL or text), model ID, and output preferences. Use your ImaRouter API key — the same one used for image and video generation.
For TTS, the request schema is compatible with the OpenAI audio API. Swap the base URL to immediately access ElevenLabs, GPT-4o Audio, and other models without changing your request code.
- STT: POST /v1/audio/transcriptions — params: file (URL), model, language
- TTS: POST /v1/audio/speech — params: input (text), model, voice, response_format
- Music: POST /v1/audio/generate — params: prompt, model, duration_seconds
- Base URL: https://api.imarouter.com/v1 — compatible with OpenAI audio client shape
FAQ
FAQ
What audio file formats are supported for transcription?
ImaRouter accepts MP3, MP4, MPEG, MPGA, M4A, WAV, and WEBM for speech-to-text. Pass the file as a URL in the request body. Maximum file size is 25MB per request — split larger files before submission.
Can I clone a voice for TTS?
Voice cloning is available through ElevenLabs models. Upload a reference audio sample through the dashboard to create a custom voice ID, then reference that ID in your TTS requests. Cloned voice IDs are scoped to your account.
How are STT and TTS priced differently?
Speech-to-text is priced per minute of audio input. Text-to-speech is priced per thousand characters of input text. Music generation is priced per generated second. All audio usage is consolidated on your single monthly invoice.
Can I use audio alongside video generation in the same pipeline?
Yes. Since all modalities share the same API key and billing layer, you can call video generation and audio synthesis in the same application without managing separate accounts. A common pattern is generating a video clip and a voice-over track in parallel, then combining them in post.
Launch paths
Related links and launch paths
More modalities
Explore more routes from the same stack

Text-to-Image API
Generate stunning visuals through a unified API layer. Top models: GPT Image 2, Midjourney, Nano banana Pro — all via one clean interface.
Learn more
Text-to-Video API
Ship cinematic clips from prompts with one contract. Top models: Seedance 2.0, Happy Horse, Wan 2.7 — routing across providers for latency and cost.
Learn more
LLM Routing API
One key for every frontier and open-weight LLM. Route Claude, GPT, Deepseek, and Qwen with automatic fallback, cost caps, and unified billing — zero per-provider wiring.
Learn more