Run Xiaomi MiMo-V2-Flash Locally — Full Installation & Setup Guide
Xiaomi MiMo-V2-Flash is one of the most impressive open-source Mixture-of-Experts models currently available.
Many developers want to run it locally for privacy, full feature access, offline capability, and zero-API limitations — but because the model is quite large, setting it up properly can feel overwhelming.
This guide walks you through everything needed to download, install, configure, and run MiMo-V2-Flash locally, with different deployment approaches depending on your hardware and technical preferences. The tutorial is written for practical use — real commands, real steps, fewer buzzwords.
What Makes MiMo-V2-Flash Special?
MiMo-V2-Flash follows an advanced Mixture-of-Experts (MoE) design: only part of the full parameter space is activated per inference, which means you get strong reasoning performance with much better efficiency than traditional giant dense models.
Key highlights:
- High-speed inference thanks to Flash-optimized architecture
- Designed for real tasks like reasoning, coding, agent workflows
- Open-source model weights available
- Supports long context scenarios
- Suitable for local research, development, and production workloads
System Requirements
Running a large MoE model is not “lightweight”. Make sure your machine is ready.
Minimum Hardware
| Component | Requirement |
|---|---|
| GPU | 12GB VRAM or higher |
| RAM | 32GB |
| Storage | 100GB+ |
| CPU | Multi-core modern processor |
Recommended for Smoother Experience
| Component | Recommended |
|---|---|
| GPU | 24GB+ VRAM |
| RAM | 64GB |
| Storage | 200GB NVMe SSD |
| Multi-GPU | Supported and useful |
💡 Tip: If your GPU memory is limited, you can use 8-bit or 4-bit loading to reduce VRAM consumption.
Software Prerequisites
Before beginning, ensure the following are installed:
- Python 3.10+
- CUDA drivers (matching your GPU)
- Git
- Latest NVIDIA GPU driver
Verify GPU visibility:
nvidia-smi
If this returns GPU information, you’re good to proceed.
Method 1 – Deploy with SGLang (Recommended)
SGLang provides excellent MoE support and is optimized for Flash-style models. If your goal is performance + stability, use this approach.
1. Create and Activate Environment
python -m venv mimo
source mimo/bin/activate # Windows: mimo\Scripts\activate
2. Install Dependencies
Install PyTorch first (CUDA version may differ depending on your system):
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
Then install SGLang:
pip install sglang
3. Download Model Files
huggingface-cli login
huggingface-cli download XiaomiMiMo/MiMo-V2-Flash --local-dir ./mimo-model
4. Start the Inference Server
python -m sglang.launch_server \
--model-path ./mimo-model \
--host 0.0.0.0 \
--port 30000 \
--trust-remote-code \
--dtype float16 \
--max-model-len 262144 \
--gpu-memory-utilization 0.9
You should now see logs indicating the model is loading. Once you see Server Ready, the API is live.
5. Test Using API
Create a quick test script:
import json
import requests
url = "http://localhost:30000/v1/chat/completions"
payload = {
"model": "MiMo-V2-Flash",
"messages":[
{"role":"user","content":"Explain cloud computing in simple terms."}
],
"max_tokens":150
}
response = requests.post(url, headers={"Content-Type":"application/json"}, data=json.dumps(payload))
print(response.json())
Run it:
python test.py
If you receive a proper response… congratulations 🎉 MiMo-V2-Flash is successfully running locally.
Method 2 — Transformers + Accelerate
If you prefer running the model directly inside Python with full control:
Install Required Libraries
pip install transformers accelerate bitsandbytes
pip install torch --index-url https://download.pytorch.org/whl/cu124
Run the Model
Create a file named:
mimo_run.py
Paste:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "XiaomiMiMo/MiMo-V2-Flash"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
torch_dtype=torch.float16,
load_in_8bit=True,
trust_remote_code=True
)
prompt = "What is the difference between VPS and dedicated servers?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Run:
python mimo_run.py
This runs inference directly, without launching a server.
Method 3 — Docker Deployment (Optional)
If you prefer isolated environments or production-grade setups, Docker is great.
Build Image
Create a Dockerfile
FROM nvidia/cuda:12.4.0-devel-ubuntu20.04
RUN apt-get update && apt-get install -y python3 python3-pip git
WORKDIR /app
RUN pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
RUN pip3 install sglang transformers accelerate
COPY models/MiMo-V2-Flash /app/models
EXPOSE 30000
CMD ["python3","-m","sglang.launch_server","--model-path","/app/models","--host","0.0.0.0","--port","30000"]
Build:
docker build -t mimo-server .
Run:
docker run --gpus all -p 30000:30000 mimo-server
Performance Optimization Tips
To improve speed, memory usage, and stability:
-
Prefer FP16 over FP32 for GPU efficiency
-
Use quantization (8-bit / 4-bit) if VRAM is limited
-
Enable Flash Attention where possible
-
Avoid running background GPU workloads
-
Monitor hardware:
watch -n 1 nvidia-smi
Basic Troubleshooting
| Issue | Possible Cause | Fix |
|---|---|---|
| CUDA error | Driver / CUDA mismatch | Reinstall CUDA or update drivers |
| Out of Memory | GPU too small | Enable 8-bit loading |
| Slow performance | CPU fallback occurring | Ensure GPU recognized |
| Model not loading | Wrong path | Verify local directory |
| API not responding | Server crashed | Check logs |
FAQ
Can I run MiMo-V2-Flash without a GPU?
Technically yes, but realistically no. Inference on CPU will be extremely slow and may even fail on larger prompts due to memory pressure.
Does this model require internet after installation?
No. Once weights are downloaded, all inference can happen fully offline.
Can I deploy this model on a VPS?
Yes — as long as the VPS provides NVIDIA GPU resources. (For example many users prefer GPU cloud VPS providers like LightNode because of flexible billing and global locations.)
How large is the model download?
Expect tens of gigabytes depending on version and precision. Ensure you have free disk space before downloading.
Is there Windows support?
Yes. All commands work on Windows with slight path adjustments. Use PowerShell or WSL2 if preferred.
Does MiMo-V2-Flash support long context?
Yes — it is designed with strong long-sequence handling capability and works well in agent and reasoning tasks.
