Video Generation API is now live!

Models

Explore the active model market,from a local OpenRouter snapshot.

This page reads from a local JSON snapshot synced from OpenRouter, so the catalog stays fast, indexable, and stable. Use it to browse current model coverage by provider, modality, reasoning support, context window, and pricing metadata.

Reset

Results

Showing 48 of 226 matching models

Snapshot source: OpenRouter. Synced April 21, 2026 at 8:00 AM. Page 3 of 5.

This route is built from local JSON so the catalog stays stable for browsing and SEO. If you need a specific model on ImaRouter, treat this page as a discovery reference and then contact the team for availability.

TextReasoning

NovitaAI

Baidu: ERNIE 4.5 VL 424B A47B

ERNIE-4.5-VL-424B-A47B is a multimodal Mixture-of-Experts (MoE) model from Baidu’s ERNIE 4.5 series, featuring 424B total parameters with 47B active per token. It is trained jointly on text and image data using a heterogeneous MoE architecture and modality-isolated routing to enable high-fidelity cross-modal reasoning, image understanding, and long-context generation (up to 131k tokens). Fine-tuned with techniques like SFT, DPO, UPO, and RLVR, this model supports both “thinking” and non-thinking inference modes. Designed for vision-language tasks in English and Chinese, it is optimized for efficient scaling and can operate under 4-bit/8-bit quantization.

TextImage

Context

123K

Group

Other

Pricing preview

Input Price: $0.42 /M tokens

Output Price: $1.25 /M tokens

Slug

baidu/ernie-4.5-vl-424b-a47b

Text

DeepInfra

Mistral: Mistral Small 3.2 24B

Mistral-Small-3.2-24B-Instruct-2506 is an updated 24B parameter model from Mistral optimized for instruction following, repetition reduction, and improved function calling. Compared to the 3.1 release, version 3.2 significantly improves accuracy on WildBench and Arena Hard, reduces infinite generations, and delivers gains in tool use and structured output tasks. It supports image and text inputs with structured outputs, function/tool calling, and strong performance across coding (HumanEval+, MBPP), STEM (MMLU, MATH, GPQA), and vision benchmarks (ChartQA, DocVQA).

TextImage

Context

128K

Group

Mistral

Pricing preview

Input Price: $0.075 /M tokens

Output Price: $0.2 /M tokens

Slug

mistralai/mistral-small-3.2-24b-instruct

TextReasoning

Google Vertex (Global)

Google: Gemini 2.5 Flash

Gemini 2.5 Flash is Google's state-of-the-art workhorse model, specifically designed for advanced reasoning, coding, mathematics, and scientific tasks. It includes built-in "thinking" capabilities, enabling it to provide responses with greater accuracy and nuanced context handling. Additionally, Gemini 2.5 Flash is configurable through the "max tokens for reasoning" parameter, as described in the documentation (https://openrouter.ai/docs/use-cases/reasoning-tokens#max-tokens-for-reasoning).

TextFileImageAudioVideo

Context

1M

Group

Gemini

Pricing preview

Input Price: $0.3 /M tokens

Output Price: $2.5 /M tokens

Slug

google/gemini-2.5-flash

TextReasoning

Google Vertex

Google: Gemini 2.5 Pro Preview 06-05

Gemini 2.5 Pro is Google’s state-of-the-art AI model designed for advanced reasoning, coding, mathematics, and scientific tasks. It employs “thinking” capabilities, enabling it to reason through responses with enhanced accuracy and nuanced context handling. Gemini 2.5 Pro achieves top-tier performance on multiple benchmarks, including first-place positioning on the LMArena leaderboard, reflecting superior human-preference alignment and complex problem-solving abilities.

TextFileImageAudio

Context

1M

Group

Gemini

Pricing preview

Input Price: $1.25 /M tokens

Output Price: $10 /M tokens

Slug

google/gemini-2.5-pro-preview

Text

Mistral

Mistral: Mistral Medium 3

Mistral Medium 3 is a high-performance enterprise-grade language model designed to deliver frontier-level capabilities at significantly reduced operational cost. It balances state-of-the-art reasoning and multimodal performance with 8× lower cost compared to traditional large models, making it suitable for scalable deployments across professional and industrial use cases. The model excels in domains such as coding, STEM reasoning, and enterprise adaptation. It supports hybrid, on-prem, and in-VPC deployments and is optimized for integration into custom workflows. Mistral Medium 3 offers competitive accuracy relative to larger models like Claude Sonnet 3.5/3.7, Llama 4 Maverick, and Command R+, while maintaining broad compatibility across cloud environments.

TextImage

Context

131.1K

Group

Mistral

Pricing preview

Input Price: $0.4 /M tokens

Output Price: $2 /M tokens

Slug

mistralai/mistral-medium-3

TextReasoning

Google Vertex

Google: Gemini 2.5 Pro Preview 05-06

Gemini 2.5 Pro is Google’s state-of-the-art AI model designed for advanced reasoning, coding, mathematics, and scientific tasks. It employs “thinking” capabilities, enabling it to reason through responses with enhanced accuracy and nuanced context handling. Gemini 2.5 Pro achieves top-tier performance on multiple benchmarks, including first-place positioning on the LMArena leaderboard, reflecting superior human-preference alignment and complex problem-solving abilities.

TextImageFileAudioVideo

Context

1M

Group

Gemini

Pricing preview

Input Price: $1.25 /M tokens

Output Price: $10 /M tokens

Slug

google/gemini-2.5-pro-preview-05-06

Text

Together

Arcee AI: Spotlight

Spotlight is a 7‑billion‑parameter vision‑language model derived from Qwen 2.5‑VL and fine‑tuned by Arcee AI for tight image‑text grounding tasks. It offers a 32 k‑token context window, enabling rich multimodal conversations that combine lengthy documents with one or more images. Training emphasized fast inference on consumer GPUs while retaining strong captioning, visual‐question‑answering, and diagram‑analysis accuracy. As a result, Spotlight slots neatly into agent workflows where screenshots, charts or UI mock‑ups need to be interpreted on the fly. Early benchmarks show it matching or out‑scoring larger VLMs such as LLaVA‑1.6 13 B on popular VQA and POPE alignment tests.

TextImage

Context

131.1K

Group

Other

Pricing preview

Input Price: $0.18 /M tokens

Output Price: $0.18 /M tokens

Slug

arcee-ai/spotlight

Text

Unknown provider

OpenGVLab: InternVL3 14B

The 14b version of the InternVL3 series. An advanced multimodal large language model (MLLM) series that demonstrates superior overall performance. Compared to InternVL 2.5, InternVL3 exhibits superior multimodal perception and reasoning capabilities, while further extending its multimodal capabilities to encompass tool usage, GUI agents, industrial image analysis, 3D vision perception, and more.

TextImage

Context

32K

Group

Other

Pricing preview

No display pricing published in the current snapshot.

Slug

opengvlab/internvl3-14b

Text

Unknown provider

OpenGVLab: InternVL3 2B

The 2b version of the InternVL3 series, for an even higher inference speed and very reasonable performance. An advanced multimodal large language model (MLLM) series that demonstrates superior overall performance. Compared to InternVL 2.5, InternVL3 exhibits superior multimodal perception and reasoning capabilities, while further extending its multimodal capabilities to encompass tool usage, GUI agents, industrial image analysis, 3D vision perception, and more.

TextImage

Context

32K

Group

Other

Pricing preview

No display pricing published in the current snapshot.

Slug

opengvlab/internvl3-2b

Text

DeepInfra

Meta: Llama Guard 4 12B

Llama Guard 4 is a Llama 4 Scout-derived multimodal pretrained model, fine-tuned for content safety classification. Similar to previous versions, it can be used to classify content in both LLM inputs (prompt classification) and in LLM responses (response classification). It acts as an LLM—generating text in its output that indicates whether a given prompt or response is safe or unsafe, and if unsafe, it also lists the content categories violated. Llama Guard 4 was aligned to safeguard against the standardized MLCommons hazards taxonomy and designed to support multimodal Llama 4 capabilities. Specifically, it combines features from previous Llama Guard models, providing content moderation for English and multiple supported languages, along with enhanced capabilities to handle mixed text-and-image prompts, including multiple images. Additionally, Llama Guard 4 is integrated into the Llama Moderations API, extending robust safety classification to text and images.

TextImage

Context

163.8K

Group

Other

Pricing preview

Input Price: $0.18 /M tokens

Output Price: $0.18 /M tokens

Slug

meta-llama/llama-guard-4-12b

TextReasoning

Unknown provider

MoonshotAI: Kimi VL A3B Thinking

Kimi-VL is a lightweight Mixture-of-Experts vision-language model that activates only 2.8B parameters per step while delivering strong performance on multimodal reasoning and long-context tasks. The Kimi-VL-A3B-Thinking variant, fine-tuned with chain-of-thought and reinforcement learning, excels in math and visual reasoning benchmarks like MathVision, MMMU, and MathVista, rivaling much larger models such as Qwen2.5-VL-7B and Gemma-3-12B. It supports 128K context and high-resolution input via its MoonViT encoder.

TextImage

Context

131.1K

Group

Other

Pricing preview

No display pricing published in the current snapshot.

Slug

moonshotai/kimi-vl-a3b-thinking

Text

DeepInfra

Meta: Llama 4 Maverick

Llama 4 Maverick 17B Instruct (128E) is a high-capacity multimodal language model from Meta, built on a mixture-of-experts (MoE) architecture with 128 experts and 17 billion active parameters per forward pass (400B total). It supports multilingual text and image input, and produces multilingual text and code output across 12 supported languages. Optimized for vision-language tasks, Maverick is instruction-tuned for assistant-like behavior, image reasoning, and general-purpose multimodal interaction. Maverick features early fusion for native multimodality and a 1 million token context window. It was trained on a curated mixture of public, licensed, and Meta-platform data, covering ~22 trillion tokens, with a knowledge cutoff in August 2024. Released on April 5, 2025 under the Llama 4 Community License, Maverick is suited for research and commercial applications requiring advanced multimodal understanding and high model throughput.

TextImage

Context

1M

Group

Llama4

Pricing preview

Input Price: $0.15 /M tokens

Output Price: $0.6 /M tokens

Slug

meta-llama/llama-4-maverick

Text

DeepInfra

Meta: Llama 4 Scout

Llama 4 Scout 17B Instruct (16E) is a mixture-of-experts (MoE) language model developed by Meta, activating 17 billion parameters out of a total of 109B. It supports native multimodal input (text and image) and multilingual output (text and code) across 12 supported languages. Designed for assistant-style interaction and visual reasoning, Scout uses 16 experts per forward pass and features a context length of 10 million tokens, with a training corpus of ~40 trillion tokens. Built for high efficiency and local or commercial deployment, Llama 4 Scout incorporates early fusion for seamless modality integration. It is instruction-tuned for use in multilingual chat, captioning, and image understanding tasks. Released under the Llama 4 Community License, it was last trained on data up to August 2024 and launched publicly on April 5, 2025.

TextImage

Context

327.7K

Group

Llama4

Pricing preview

Input Price: $0.08 /M tokens

Output Price: $0.3 /M tokens

Slug

meta-llama/llama-4-scout

Text

Unknown provider

Bytedance: UI-TARS 72B

UI-TARS 72B is an open-source multimodal AI model designed specifically for automating browser and desktop tasks through visual interaction and control. The model is built with a specialized vision architecture enabling accurate interpretation and manipulation of on-screen visual data. It supports automation tasks within web browsers as well as desktop applications, including Microsoft Office and VS Code. Core capabilities include intelligent screen detection, predictive action modeling, and efficient handling of repetitive interactions. UI-TARS employs supervised fine-tuning (SFT) tailored explicitly for computer control scenarios. It can be deployed locally or accessed via Hugging Face for demonstration purposes. Intended use cases encompass workflow automation, task scripting, and interactive desktop control applications.

TextImage

Context

32.8K

Group

Other

Pricing preview

No display pricing published in the current snapshot.

Slug

bytedance-research/ui-tars-72b

Text

Unknown provider

Qwen: Qwen2.5 VL 3B Instruct

Qwen2.5 VL 3B is a multimodal LLM from the Qwen Team with the following key enhancements: - SoTA understanding of images of various resolution & ratio: Qwen2.5-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc. - Agent that can operate your mobiles, robots, etc.: with the abilities of complex reasoning and decision making, Qwen2.5-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions. - Multilingual Support: to serve global users, besides English and Chinese, Qwen2.5-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc. For more details, see this [blog post](https://qwenlm.github.io/blog/qwen2-vl/) and [GitHub repo](https://github.com/QwenLM/Qwen2-VL). Usage of this model is subject to [Tongyi Qianwen LICENSE AGREEMENT](https://huggingface.co/Qwen/Qwen1.5-110B-Chat/blob/main/LICENSE).

TextImage

Context

64K

Group

Qwen

Pricing preview

No display pricing published in the current snapshot.

Slug

qwen/qwen2.5-vl-3b-instruct

Text

Unknown provider

Google: Gemini 2.5 Pro Experimental

This model has been deprecated by Google in favor of the (paid Preview model)[google/gemini-2.5-pro-preview] Gemini 2.5 Pro is Google’s state-of-the-art AI model designed for advanced reasoning, coding, mathematics, and scientific tasks. It employs “thinking” capabilities, enabling it to reason through responses with enhanced accuracy and nuanced context handling. Gemini 2.5 Pro achieves top-tier performance on multiple benchmarks, including first-place positioning on the LMArena leaderboard, reflecting superior human-preference alignment and complex problem-solving abilities.

TextImageFile

Context

1M

Group

Gemini

Pricing preview

No display pricing published in the current snapshot.

Slug

google/gemini-2.5-pro-exp-03-25

Text

Unknown provider

Qwen: Qwen2.5 VL 32B Instruct

Qwen2.5-VL-32B is a multimodal vision-language model fine-tuned through reinforcement learning for enhanced mathematical reasoning, structured outputs, and visual problem-solving capabilities. It excels at visual analysis tasks, including object recognition, textual interpretation within images, and precise event localization in extended videos. Qwen2.5-VL-32B demonstrates state-of-the-art performance across multimodal benchmarks such as MMMU, MathVista, and VideoMME, while maintaining strong reasoning and clarity in text-based tasks like MMLU, mathematical problem-solving, and code generation.

TextImage

Context

32.8K

Group

Qwen

Pricing preview

No display pricing published in the current snapshot.

Slug

qwen/qwen2.5-vl-32b-instruct

Text

Cloudflare

Mistral: Mistral Small 3.1 24B

Mistral Small 3.1 24B Instruct is an upgraded variant of Mistral Small 3 (2501), featuring 24 billion parameters with advanced multimodal capabilities. It provides state-of-the-art performance in text-based reasoning and vision tasks, including image analysis, programming, mathematical reasoning, and multilingual support across dozens of languages. Equipped with an extensive 128k token context window and optimized for efficient local inference, it supports use cases such as conversational agents, function calling, long-document comprehension, and privacy-sensitive deployments. The updated version is [Mistral Small 3.2](mistralai/mistral-small-3.2-24b-instruct)

TextImage

Context

128K

Group

Mistral

Pricing preview

Input Price: $0.35 /M tokens

Output Price: $0.56 /M tokens

Slug

mistralai/mistral-small-3.1-24b-instruct

Text

Unknown provider

Google: Gemma 3 1B

Gemma 3 1B is the smallest of the new Gemma 3 family. It handles context windows up to 32k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling. Note: Gemma 3 1B is not multimodal. For the smallest multimodal Gemma 3 model, please see [Gemma 3 4B](google/gemma-3-4b-it)

TextImage

Context

32K

Group

Gemini

Pricing preview

No display pricing published in the current snapshot.

Slug

google/gemma-3-1b-it

Text

Google AI Studio

Google: Gemma 3 4B (free)

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling.

TextImage

Context

32.8K

Group

Gemini

Pricing preview

Input Price: $0 /M tokens

Output Price: $0 /M tokens

Slug

google/gemma-3-4b-it

Text

DeepInfra

Google: Gemma 3 4B

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling.

TextImage

Context

131.1K

Group

Gemini

Pricing preview

Input Price: $0.04 /M tokens

Output Price: $0.08 /M tokens

Slug

google/gemma-3-4b-it

Text

Google AI Studio

Google: Gemma 3 12B (free)

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling. Gemma 3 12B is the second largest in the family of Gemma 3 models after [Gemma 3 27B](google/gemma-3-27b-it)

TextImage

Context

32.8K

Group

Gemini

Pricing preview

Input Price: $0 /M tokens

Output Price: $0 /M tokens

Slug

google/gemma-3-12b-it

Text

DeepInfra

Google: Gemma 3 12B

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling. Gemma 3 12B is the second largest in the family of Gemma 3 models after [Gemma 3 27B](google/gemma-3-27b-it)

TextImage

Context

131.1K

Group

Gemini

Pricing preview

Input Price: $0.04 /M tokens

Output Price: $0.13 /M tokens

Slug

google/gemma-3-12b-it

Text

Google AI Studio

Google: Gemma 3 27B (free)

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling. Gemma 3 27B is Google's latest open source model, successor to [Gemma 2](google/gemma-2-27b-it)

TextImage

Context

131.1K

Group

Gemini

Pricing preview

Input Price: $0 /M tokens

Output Price: $0 /M tokens

Slug

google/gemma-3-27b-it

Text

DeepInfra

Google: Gemma 3 27B

Gemma 3 introduces multimodality, supporting vision-language input and text outputs. It handles context windows up to 128k tokens, understands over 140 languages, and offers improved math, reasoning, and chat capabilities, including structured outputs and function calling. Gemma 3 27B is Google's latest open source model, successor to [Gemma 2](google/gemma-2-27b-it)

TextImage

Context

131.1K

Group

Gemini

Pricing preview

Input Price: $0.08 /M tokens

Output Price: $0.16 /M tokens

Slug

google/gemma-3-27b-it

Text

Unknown provider

Microsoft: Phi 4 Multimodal Instruct

Phi-4 Multimodal Instruct is a versatile 5.6B parameter foundation model that combines advanced reasoning and instruction-following capabilities across both text and visual inputs, providing accurate text outputs. The unified architecture enables efficient, low-latency inference, suitable for edge and mobile deployments. Phi-4 Multimodal Instruct supports text inputs in multiple languages including Arabic, Chinese, English, French, German, Japanese, Spanish, and more, with visual input optimized primarily for English. It delivers impressive performance on multimodal tasks involving mathematical, scientific, and document reasoning, providing developers and enterprises a powerful yet compact model for sophisticated interactive applications. For more information, see the [Phi-4 Multimodal blog post](https://azure.microsoft.com/en-us/blog/empowering-innovation-the-next-generation-of-the-phi-family/).

TextImage

Context

131.1K

Group

Other

Pricing preview

No display pricing published in the current snapshot.

Slug

microsoft/phi-4-multimodal-instruct

Text

Google Vertex

Google: Gemini 2.0 Flash Lite

Gemini 2.0 Flash Lite offers a significantly faster time to first token (TTFT) compared to [Gemini Flash 1.5](/google/gemini-flash-1.5), while maintaining quality on par with larger models like [Gemini Pro 1.5](/google/gemini-pro-1.5), all at extremely economical token prices.

TextImageFileAudioVideo

Context

1M

Group

Gemini

Pricing preview

Input Price: $0.075 /M tokens

Output Price: $0.3 /M tokens

Slug

google/gemini-2.0-flash-lite-001

Text

Google Vertex

Google: Gemini 2.0 Flash

Gemini Flash 2.0 offers a significantly faster time to first token (TTFT) compared to [Gemini Flash 1.5](/google/gemini-flash-1.5), while maintaining quality on par with larger models like [Gemini Pro 1.5](/google/gemini-pro-1.5). It introduces notable enhancements in multimodal understanding, coding capabilities, complex instruction following, and function calling. These advancements come together to deliver more seamless and robust agentic experiences.

TextImageFileAudioVideo

Context

1M

Group

Gemini

Pricing preview

Input Price: $0.1 /M tokens

Output Price: $0.4 /M tokens

Slug

google/gemini-2.0-flash-001

Text

Alibaba Cloud Int.

Qwen: Qwen VL Plus

Qwen's Enhanced Large Visual Language Model. Significantly upgraded for detailed recognition capabilities and text recognition abilities, supporting ultra-high pixel resolutions up to millions of pixels and extreme aspect ratios for image input. It delivers significant performance across a broad range of visual tasks.

TextImage

Context

131.1K

Group

Qwen

Pricing preview

Input Price: $0.1365 /M tokens

Output Price: $0.4095 /M tokens

Slug

qwen/qwen-vl-plus

Text

Alibaba Cloud Int.

Qwen: Qwen VL Max

Qwen VL Max is a visual understanding model with 7500 tokens context length. It excels in delivering optimal performance for a broader spectrum of complex tasks.

TextImage

Context

131.1K

Group

Qwen

Pricing preview

Input Price: $0.52 /M tokens

Output Price: $2.08 /M tokens

Slug

qwen/qwen-vl-max

Text

Nebius Token Factory

Qwen: Qwen2.5 VL 72B Instruct

Qwen2.5-VL is proficient in recognizing common objects such as flowers, birds, fish, and insects. It is also highly capable of analyzing texts, charts, icons, graphics, and layouts within images.

TextImage

Context

32K

Group

Qwen

Pricing preview

Input Price: $0.25 /M tokens

Output Price: $0.75 /M tokens

Slug

qwen/qwen2.5-vl-72b-instruct

Text

MiniMax

MiniMax: MiniMax-01

MiniMax-01 is a combines MiniMax-Text-01 for text generation and MiniMax-VL-01 for image understanding. It has 456 billion parameters, with 45.9 billion parameters activated per inference, and can handle a context of up to 4 million tokens. The text model adopts a hybrid architecture that combines Lightning Attention, Softmax Attention, and Mixture-of-Experts (MoE). The image model adopts the “ViT-MLP-LLM” framework and is trained on top of the text model. To read more about the release, see: https://www.minimaxi.com/en/news/minimax-01-series-2

TextImage

Context

1M

Group

Other

Pricing preview

Input Price: $0.2 /M tokens

Output Price: $1.1 /M tokens

Slug

minimax/minimax-01

Text

Unknown provider

xAI: Grok 2 Vision 1212

Grok 2 Vision 1212 advances image-based AI with stronger visual comprehension, refined instruction-following, and multilingual support. From object recognition to style analysis, it empowers developers to build more intuitive, visually aware applications. Its enhanced steerability and reasoning establish a robust foundation for next-generation image solutions. To read more about this model, check out [xAI's announcement](https://x.ai/blog/grok-1212).

TextImage

Context

32.8K

Group

Grok

Pricing preview

No display pricing published in the current snapshot.

Slug

x-ai/grok-2-vision-1212

Text

Unknown provider

Google: Gemini 2.0 Flash Experimental

Gemini Flash 2.0 offers a significantly faster time to first token (TTFT) compared to [Gemini Flash 1.5](/google/gemini-flash-1.5), while maintaining quality on par with larger models like [Gemini Pro 1.5](/google/gemini-pro-1.5). It introduces notable enhancements in multimodal understanding, coding capabilities, complex instruction following, and function calling. These advancements come together to deliver more seamless and robust agentic experiences.

TextImage

Context

1M

Group

Gemini

Pricing preview

No display pricing published in the current snapshot.

Slug

google/gemini-2.0-flash-exp

Text

Amazon Bedrock

Amazon: Nova Lite 1.0

Amazon Nova Lite 1.0 is a very low-cost multimodal model from Amazon that focused on fast processing of image, video, and text inputs to generate text output. Amazon Nova Lite can handle real-time customer interactions, document analysis, and visual question-answering tasks with high accuracy. With an input context of 300K tokens, it can analyze multiple images or up to 30 minutes of video in a single input.

TextImage

Context

300K

Group

Nova

Pricing preview

Input Price: $0.06 /M tokens

Output Price: $0.24 /M tokens

Slug

amazon/nova-lite-v1

Text

Amazon Bedrock

Amazon: Nova Pro 1.0

Amazon Nova Pro 1.0 is a capable multimodal model from Amazon focused on providing a combination of accuracy, speed, and cost for a wide range of tasks. As of December 2024, it achieves state-of-the-art performance on key benchmarks including visual question answering (TextVQA) and video understanding (VATEX). Amazon Nova Pro demonstrates strong capabilities in processing both visual and textual information and at analyzing financial documents. **NOTE**: Video input is not supported at this time.

TextImage

Context

300K

Group

Nova

Pricing preview

Input Price: $0.8 /M tokens

Output Price: $3.2 /M tokens

Slug

amazon/nova-pro-v1

Text

Unknown provider

Google: Gemini Experimental 1121

Experimental release (November 21st, 2024) of Gemini.

TextImage

Context

41K

Group

Gemini

Pricing preview

No display pricing published in the current snapshot.

Slug

google/gemini-exp-1121

Text

Mistral

Mistral: Pixtral Large 2411

Pixtral Large is a 124B parameter, open-weight, multimodal model built on top of [Mistral Large 2](/mistralai/mistral-large-2411). The model is able to understand documents, charts and natural images. The model is available under the Mistral Research License (MRL) for research and educational use, and the Mistral Commercial License for experimentation, testing, and production for commercial purposes.

TextImage

Context

131.1K

Group

Mistral

Pricing preview

Input Price: $2 /M tokens

Output Price: $6 /M tokens

Slug

mistralai/pixtral-large-2411

Text

Unknown provider

xAI: Grok Vision Beta

Grok Vision Beta is xAI's experimental language model with vision capability.

TextImage

Context

8.2K

Group

Grok

Pricing preview

No display pricing published in the current snapshot.

Slug

x-ai/grok-vision-beta

Text

Unknown provider

Google: Gemini Experimental 1114

Gemini 11-14 (2024) experimental model features "quality" improvements.

TextImage

Context

41K

Group

Gemini

Pricing preview

No display pricing published in the current snapshot.

Slug

google/gemini-exp-1114

Text

Amazon Bedrock (US-WEST)

Anthropic: Claude 3.5 Haiku

Claude 3.5 Haiku features offers enhanced capabilities in speed, coding accuracy, and tool use. Engineered to excel in real-time applications, it delivers quick response times that are essential for dynamic tasks such as chat interactions and immediate coding suggestions. This makes it highly suitable for environments that demand both speed and precision, such as software development, customer service bots, and data management systems. This model is currently pointing to [Claude 3.5 Haiku (2024-10-22)](/anthropic/claude-3-5-haiku-20241022).

TextImage

Context

200K

Group

Claude

Pricing preview

Input Price: $0.8 /M tokens

Output Price: $4 /M tokens

Slug

anthropic/claude-3.5-haiku

Text

Unknown provider

Anthropic: Claude 3.5 Haiku (2024-10-22)

Claude 3.5 Haiku features enhancements across all skill sets including coding, tool use, and reasoning. As the fastest model in the Anthropic lineup, it offers rapid response times suitable for applications that require high interactivity and low latency, such as user-facing chatbots and on-the-fly code completions. It also excels in specialized tasks like data extraction and real-time content moderation, making it a versatile tool for a broad range of industries. It does not support image inputs. See the launch announcement and benchmark results [here](https://www.anthropic.com/news/3-5-models-and-computer-use)

TextImageFile

Context

200K

Group

Claude

Pricing preview

No display pricing published in the current snapshot.

Slug

anthropic/claude-3.5-haiku-20241022

Text

Unknown provider

Google: Gemini 1.5 Flash 8B

Gemini Flash 1.5 8B is optimized for speed and efficiency, offering enhanced performance in small prompt tasks like chat, transcription, and translation. With reduced latency, it is highly effective for real-time and large-scale operations. This model focuses on cost-effective solutions while maintaining high-quality results. [Click here to learn more about this model](https://developers.googleblog.com/en/gemini-15-flash-8b-is-now-generally-available-for-use/). Usage of Gemini is subject to Google's [Gemini Terms of Use](https://ai.google.dev/terms).

TextImage

Context

1M

Group

Gemini

Pricing preview

No display pricing published in the current snapshot.

Slug

google/gemini-flash-1.5-8b

Text

Unknown provider

Meta: Llama 3.2 90B Vision Instruct

The Llama 90B Vision model is a top-tier, 90-billion-parameter multimodal model designed for the most challenging visual reasoning and language tasks. It offers unparalleled accuracy in image captioning, visual question answering, and advanced image-text comprehension. Pre-trained on vast multimodal datasets and fine-tuned with human feedback, the Llama 90B Vision is engineered to handle the most demanding image-based AI tasks. This model is perfect for industries requiring cutting-edge multimodal AI capabilities, particularly those dealing with complex, real-time visual and textual analysis. Click here for the [original model card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD_VISION.md). Usage of this model is subject to [Meta's Acceptable Use Policy](https://www.llama.com/llama3/use-policy/).

TextImage

Context

131.1K

Group

Llama3

Pricing preview

No display pricing published in the current snapshot.

Slug

meta-llama/llama-3.2-90b-vision-instruct

Text

DeepInfra

Meta: Llama 3.2 11B Vision Instruct

Llama 3.2 11B Vision is a multimodal model with 11 billion parameters, designed to handle tasks combining visual and textual data. It excels in tasks such as image captioning and visual question answering, bridging the gap between language generation and visual reasoning. Pre-trained on a massive dataset of image-text pairs, it performs well in complex, high-accuracy image analysis. Its ability to integrate visual understanding with language processing makes it an ideal solution for industries requiring comprehensive visual-linguistic AI applications, such as content creation, AI-driven customer service, and research. Click here for the [original model card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD_VISION.md). Usage of this model is subject to [Meta's Acceptable Use Policy](https://www.llama.com/llama3/use-policy/).

TextImage

Context

131.1K

Group

Llama3

Pricing preview

Input Price: $0.245 /M tokens

Output Price: $0.245 /M tokens

Slug

meta-llama/llama-3.2-11b-vision-instruct

Text

Unknown provider

Mistral: Pixtral 12B

The first multi-modal, text+image-to-text model from Mistral AI. Its weights were launched via torrent: https://x.com/mistralai/status/1833758285167722836.

TextImage

Context

4.1K

Group

Mistral

Pricing preview

No display pricing published in the current snapshot.

Slug

mistralai/pixtral-12b

Text

Unknown provider

Qwen: Qwen2.5-VL 7B Instruct

Qwen2.5 VL 7B is a multimodal LLM from the Qwen Team with the following key enhancements: - SoTA understanding of images of various resolution & ratio: Qwen2.5-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc. - Understanding videos of 20min+: Qwen2.5-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc. - Agent that can operate your mobiles, robots, etc.: with the abilities of complex reasoning and decision making, Qwen2.5-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions. - Multilingual Support: to serve global users, besides English and Chinese, Qwen2.5-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc. For more details, see this [blog post](https://qwenlm.github.io/blog/qwen2-vl/) and [GitHub repo](https://github.com/QwenLM/Qwen2-VL). Usage of this model is subject to [Tongyi Qianwen LICENSE AGREEMENT](https://huggingface.co/Qwen/Qwen1.5-110B-Chat/blob/main/LICENSE).

TextImage

Context

32.8K

Group

Qwen

Pricing preview

No display pricing published in the current snapshot.

Slug

qwen/qwen-2.5-vl-7b-instruct

Text

Unknown provider

Google: Gemini 1.5 Flash Experimental

Gemini 1.5 Flash Experimental is an experimental version of the [Gemini 1.5 Flash](/models/google/gemini-flash-1.5) model. Usage of Gemini is subject to Google's [Gemini Terms of Use](https://ai.google.dev/terms). #multimodal Note: This model is experimental and not suited for production use-cases. It may be removed or redirected to another model in the future.

TextImage

Context

1M

Group

Gemini

Pricing preview

No display pricing published in the current snapshot.

Slug

google/gemini-flash-1.5-exp

Page 3 of 5

Need a model request?

Use the market snapshot for discovery, then ask ImaRouter for rollout.

If a model matters for your product, send the slug, expected traffic, target region, and latency expectations. The team can confirm support status, onboarding priority, or a migration path to an equivalent route on ImaRouter.

Contact

support@imarouter.com

Best for model availability questions, onboarding priority, routing strategy, and enterprise rollout planning.

Models | ImaRouter