Run Xiaomi MiMo-V2-Flash Locally — Full Installation & Setup Guide

Xiaomi MiMo-V2-Flash is one of the most impressive open-source Mixture-of-Experts models currently available.
Many developers want to run it locally for privacy, full feature access, offline capability, and zero-API limitations — but because the model is quite large, setting it up properly can feel overwhelming.

This guide walks you through everything needed to download, install, configure, and run MiMo-V2-Flash locally, with different deployment approaches depending on your hardware and technical preferences. The tutorial is written for practical use — real commands, real steps, fewer buzzwords.

What Makes MiMo-V2-Flash Special?

MiMo-V2-Flash follows an advanced Mixture-of-Experts (MoE) design: only part of the full parameter space is activated per inference, which means you get strong reasoning performance with much better efficiency than traditional giant dense models.

Key highlights:

High-speed inference thanks to Flash-optimized architecture
Designed for real tasks like reasoning, coding, agent workflows
Open-source model weights available
Supports long context scenarios
Suitable for local research, development, and production workloads

System Requirements

Running a large MoE model is not “lightweight”. Make sure your machine is ready.

Minimum Hardware

Component	Requirement
GPU	12GB VRAM or higher
RAM	32GB
Storage	100GB+
CPU	Multi-core modern processor

Recommended for Smoother Experience

Component	Recommended
GPU	24GB+ VRAM
RAM	64GB
Storage	200GB NVMe SSD
Multi-GPU	Supported and useful

💡 Tip: If your GPU memory is limited, you can use 8-bit or 4-bit loading to reduce VRAM consumption.

Software Prerequisites

Before beginning, ensure the following are installed:

Python 3.10+
CUDA drivers (matching your GPU)
Git
Latest NVIDIA GPU driver

Verify GPU visibility:

nvidia-smi

If this returns GPU information, you’re good to proceed.

Method 1 – Deploy with SGLang (Recommended)

SGLang provides excellent MoE support and is optimized for Flash-style models. If your goal is performance + stability, use this approach.

1. Create and Activate Environment

python -m venv mimo
source mimo/bin/activate   # Windows: mimo\Scripts\activate

2. Install Dependencies

Install PyTorch first (CUDA version may differ depending on your system):

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

Then install SGLang:

pip install sglang

3. Download Model Files

huggingface-cli login
huggingface-cli download XiaomiMiMo/MiMo-V2-Flash --local-dir ./mimo-model

4. Start the Inference Server

python -m sglang.launch_server \
  --model-path ./mimo-model \
  --host 0.0.0.0 \
  --port 30000 \
  --trust-remote-code \
  --dtype float16 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.9

You should now see logs indicating the model is loading. Once you see Server Ready, the API is live.

5. Test Using API

Create a quick test script:

import json
import requests

url = "http://localhost:30000/v1/chat/completions"

payload = {
  "model": "MiMo-V2-Flash",
  "messages":[
    {"role":"user","content":"Explain cloud computing in simple terms."}
  ],
  "max_tokens":150
}

response = requests.post(url, headers={"Content-Type":"application/json"}, data=json.dumps(payload))
print(response.json())

Run it:

python test.py

If you receive a proper response… congratulations 🎉 MiMo-V2-Flash is successfully running locally.

Method 2 — Transformers + Accelerate

If you prefer running the model directly inside Python with full control:

Install Required Libraries

pip install transformers accelerate bitsandbytes
pip install torch --index-url https://download.pytorch.org/whl/cu124

Run the Model

Create a file named:

mimo_run.py

Paste:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "XiaomiMiMo/MiMo-V2-Flash"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.float16,
    load_in_8bit=True,
    trust_remote_code=True
)

prompt = "What is the difference between VPS and dedicated servers?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=200)

print(tokenizer.decode(output[0], skip_special_tokens=True))

Run:

python mimo_run.py

This runs inference directly, without launching a server.

Method 3 — Docker Deployment (Optional)

If you prefer isolated environments or production-grade setups, Docker is great.

Build Image

Create a Dockerfile

FROM nvidia/cuda:12.4.0-devel-ubuntu20.04

RUN apt-get update && apt-get install -y python3 python3-pip git

WORKDIR /app

RUN pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
RUN pip3 install sglang transformers accelerate

COPY models/MiMo-V2-Flash /app/models

EXPOSE 30000

CMD ["python3","-m","sglang.launch_server","--model-path","/app/models","--host","0.0.0.0","--port","30000"]

Build:

docker build -t mimo-server .

Run:

docker run --gpus all -p 30000:30000 mimo-server

Performance Optimization Tips

To improve speed, memory usage, and stability:

Prefer FP16 over FP32 for GPU efficiency
Use quantization (8-bit / 4-bit) if VRAM is limited
Enable Flash Attention where possible
Avoid running background GPU workloads
Monitor hardware:

watch -n 1 nvidia-smi

Basic Troubleshooting

Issue	Possible Cause	Fix
CUDA error	Driver / CUDA mismatch	Reinstall CUDA or update drivers
Out of Memory	GPU too small	Enable 8-bit loading
Slow performance	CPU fallback occurring	Ensure GPU recognized
Model not loading	Wrong path	Verify local directory
API not responding	Server crashed	Check logs

FAQ

Can I run MiMo-V2-Flash without a GPU?

Technically yes, but realistically no. Inference on CPU will be extremely slow and may even fail on larger prompts due to memory pressure.

Does this model require internet after installation?

No. Once weights are downloaded, all inference can happen fully offline.

Can I deploy this model on a VPS?

Yes — as long as the VPS provides NVIDIA GPU resources. (For example many users prefer GPU cloud VPS providers like LightNode because of flexible billing and global locations.)

How large is the model download?

Expect tens of gigabytes depending on version and precision. Ensure you have free disk space before downloading.

Is there Windows support?

Yes. All commands work on Windows with slight path adjustments. Use PowerShell or WSL2 if preferred.

Does MiMo-V2-Flash support long context?

Yes — it is designed with strong long-sequence handling capability and works well in agent and reasoning tasks.