Xiaomi releases MiMo-V2.5 and MiMo-V2.5-Pro open-source AI models with MoE architecture

Xiaomi has officially open-sourced the MiMo-V2.5 model series under the MIT License. The release enables commercial use, continued training, and fine-tuning without additional authorization.

The series includes two models designed for different workloads:

MiMo-V2.5-Pro: Focused on complex agentic tasks, coding, and long-horizon reasoning
MiMo-V2.5: A native omnimodal model built for multimodal understanding and agent capabilities

Both models support a 1M-token context window and are based on a Sparse Mixture-of-Experts (MoE) architecture with a hybrid attention design.

MiMo-V2.5

MiMo-V2.5 is a native omnimodal model designed to process text, image, video, and audio within a unified system. It targets multimodal reasoning, perception, and agent-style workflows.

The model has 310B total parameters with 15B active parameters and is trained on approximately 48 trillion tokens. It is built on the MiMo-V2-Flash hybrid sliding-window attention backbone and integrates dedicated vision and audio encoders through lightweight projection layers.

Core Capabilities

MiMo-V2.5 supports unified multimodal understanding across multiple input types:

Text, image, video, and audio processing in a single model
Long-context reasoning up to 1M tokens
Multimodal agent execution and tool interaction
Unified perception-to-action workflow

Architecture Overview

The model integrates multiple components into a single architecture:

Hybrid attention combining Sliding Window Attention and Global Attention
729M-parameter Vision Transformer with hybrid window design
Audio encoder derived from MiMo-Audio tokenizer weights
Multi-Token Prediction (MTP) modules for improved inference efficiency

Training Pipeline

MiMo-V2.5 is trained through five structured stages:

Large-scale text pre-training for foundational language capability
Projector warmup to align vision and audio encoders
Multimodal pre-training on cross-modal datasets
Supervised fine-tuning with agentic training and progressive context scaling (32K → 256K → 1M)
Reinforcement learning and Multi-Teacher On-Policy Distillation (MOPD)

MiMo-V2.5-Pro

MiMo-V2.5-Pro is the most capable model in the series, designed for advanced coding, software engineering, and long-horizon autonomous agent tasks.

It is a 1.02T-parameter Mixture-of-Experts model with 42B active parameters, supporting a 1M-token context window. It is built using a hybrid attention architecture optimized for long-context efficiency and stability.

Architecture Design

Hybrid attention with 6:1 Sliding Window to Global Attention ratio
Approximately 7× KV-cache reduction
Native Multi-Token Prediction (3-layer MTP)
Support for 1M-token context processing

Training Strategy

The model is trained through a multi-stage pipeline:

Pre-training on 27T tokens using FP8 mixed precision
Supervised Fine-Tuning (SFT) for instruction learning
Domain-specific reinforcement learning (coding, math, safety, agent tasks)
Multi-Teacher On-Policy Distillation (MOPD)
Progressive context extension up to 1M tokens

Agentic and Coding Performance

MiMo-V2.5-Pro is optimized for long-running autonomous workflows and structured reasoning tasks. It is capable of:

Sustaining 1000+ tool-call sequences in a single task
Maintaining coherence across extended reasoning chains
Handling complex software engineering workflows
Executing multi-step agent operations

It has been evaluated on large-scale tasks including compiler construction, full application generation, and circuit design optimization.

Model Specifications

Both MiMo-V2.5 and MiMo-V2.5-Pro are built on Sparse Mixture-of-Experts architecture with hybrid attention and long-context support.

Model	Total Params	Active Params	Context Length	Precision
MiMo-V2.5-Base	310B	15B	256K	FP8 (E4M3) Mixed
MiMo-V2.5	310B	15B	1M	FP8 (E4M3) Mixed
MiMo-V2.5-Pro-Base	1.02T	42B	256K	FP8 (E4M3) Mixed
MiMo-V2.5-Pro	1.02T	42B	1M	FP8 (E4M3) Mixed

Token Plan Update

The token system has been simplified across the model series:

MiMo-V2.5 — 1x (1 token = 1 credit)
MiMo-V2.5-Pro — 2x (1 token = 2 credits)

Availability

The MiMo-V2.5 series is fully open-sourced under the MIT License, allowing commercial use, continued training, and fine-tuning without additional authorization.

Both models are available for download and access via Hugging Face, along with model weights, tokenizer, and full model cards.

Deployment Support

The models are compatible with modern inference frameworks:

SGLang (recommended for long-context optimization)
vLLM (community-supported deployment)

Both support efficient inference for workloads up to 1M tokens.