Run Xiaomi MiMo-V2-Flash Locally — Full Installation & Setup Guide

LightNode
By LightNode ·

Xiaomi MiMo-V2-Flash is one of the most impressive open-source Mixture-of-Experts models currently available.
Many developers want to run it locally for privacy, full feature access, offline capability, and zero-API limitations — but because the model is quite large, setting it up properly can feel overwhelming.

This guide walks you through everything needed to download, install, configure, and run MiMo-V2-Flash locally, with different deployment approaches depending on your hardware and technical preferences. The tutorial is written for practical use — real commands, real steps, fewer buzzwords.

What Makes MiMo-V2-Flash Special?

MiMo-V2-Flash follows an advanced Mixture-of-Experts (MoE) design: only part of the full parameter space is activated per inference, which means you get strong reasoning performance with much better efficiency than traditional giant dense models.

Key highlights:

  • High-speed inference thanks to Flash-optimized architecture
  • Designed for real tasks like reasoning, coding, agent workflows
  • Open-source model weights available
  • Supports long context scenarios
  • Suitable for local research, development, and production workloads

System Requirements

Running a large MoE model is not “lightweight”. Make sure your machine is ready.

Minimum Hardware

Component Requirement
GPU 12GB VRAM or higher
RAM 32GB
Storage 100GB+
CPU Multi-core modern processor
Component Recommended
GPU 24GB+ VRAM
RAM 64GB
Storage 200GB NVMe SSD
Multi-GPU Supported and useful

💡 Tip: If your GPU memory is limited, you can use 8-bit or 4-bit loading to reduce VRAM consumption.

Software Prerequisites

Before beginning, ensure the following are installed:

  • Python 3.10+
  • CUDA drivers (matching your GPU)
  • Git
  • Latest NVIDIA GPU driver

Verify GPU visibility:

nvidia-smi

If this returns GPU information, you’re good to proceed.

SGLang provides excellent MoE support and is optimized for Flash-style models. If your goal is performance + stability, use this approach.

1. Create and Activate Environment

python -m venv mimo
source mimo/bin/activate   # Windows: mimo\Scripts\activate

2. Install Dependencies

Install PyTorch first (CUDA version may differ depending on your system):

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

Then install SGLang:

pip install sglang

3. Download Model Files

huggingface-cli login
huggingface-cli download XiaomiMiMo/MiMo-V2-Flash --local-dir ./mimo-model

4. Start the Inference Server

python -m sglang.launch_server \
  --model-path ./mimo-model \
  --host 0.0.0.0 \
  --port 30000 \
  --trust-remote-code \
  --dtype float16 \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.9

You should now see logs indicating the model is loading. Once you see Server Ready, the API is live.

5. Test Using API

Create a quick test script:

import json
import requests

url = "http://localhost:30000/v1/chat/completions"

payload = {
  "model": "MiMo-V2-Flash",
  "messages":[
    {"role":"user","content":"Explain cloud computing in simple terms."}
  ],
  "max_tokens":150
}

response = requests.post(url, headers={"Content-Type":"application/json"}, data=json.dumps(payload))
print(response.json())

Run it:

python test.py

If you receive a proper response… congratulations 🎉 MiMo-V2-Flash is successfully running locally.

Method 2 — Transformers + Accelerate

If you prefer running the model directly inside Python with full control:

Install Required Libraries

pip install transformers accelerate bitsandbytes
pip install torch --index-url https://download.pytorch.org/whl/cu124

Run the Model

Create a file named:

mimo_run.py

Paste:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "XiaomiMiMo/MiMo-V2-Flash"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.float16,
    load_in_8bit=True,
    trust_remote_code=True
)

prompt = "What is the difference between VPS and dedicated servers?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=200)

print(tokenizer.decode(output[0], skip_special_tokens=True))

Run:

python mimo_run.py

This runs inference directly, without launching a server.

Method 3 — Docker Deployment (Optional)

If you prefer isolated environments or production-grade setups, Docker is great.

Build Image

Create a Dockerfile

FROM nvidia/cuda:12.4.0-devel-ubuntu20.04

RUN apt-get update && apt-get install -y python3 python3-pip git

WORKDIR /app

RUN pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
RUN pip3 install sglang transformers accelerate

COPY models/MiMo-V2-Flash /app/models

EXPOSE 30000

CMD ["python3","-m","sglang.launch_server","--model-path","/app/models","--host","0.0.0.0","--port","30000"]

Build:

docker build -t mimo-server .

Run:

docker run --gpus all -p 30000:30000 mimo-server

Performance Optimization Tips

To improve speed, memory usage, and stability:

  • Prefer FP16 over FP32 for GPU efficiency

  • Use quantization (8-bit / 4-bit) if VRAM is limited

  • Enable Flash Attention where possible

  • Avoid running background GPU workloads

  • Monitor hardware:

watch -n 1 nvidia-smi

Basic Troubleshooting

Issue Possible Cause Fix
CUDA error Driver / CUDA mismatch Reinstall CUDA or update drivers
Out of Memory GPU too small Enable 8-bit loading
Slow performance CPU fallback occurring Ensure GPU recognized
Model not loading Wrong path Verify local directory
API not responding Server crashed Check logs

FAQ

Can I run MiMo-V2-Flash without a GPU?

Technically yes, but realistically no. Inference on CPU will be extremely slow and may even fail on larger prompts due to memory pressure.

Does this model require internet after installation?

No. Once weights are downloaded, all inference can happen fully offline.

Can I deploy this model on a VPS?

Yes — as long as the VPS provides NVIDIA GPU resources. (For example many users prefer GPU cloud VPS providers like LightNode because of flexible billing and global locations.)

How large is the model download?

Expect tens of gigabytes depending on version and precision. Ensure you have free disk space before downloading.

Is there Windows support?

Yes. All commands work on Windows with slight path adjustments. Use PowerShell or WSL2 if preferred.

Does MiMo-V2-Flash support long context?

Yes — it is designed with strong long-sequence handling capability and works well in agent and reasoning tasks.