Qwen3-30B-A3B-NotaMoeQuant-Int4

Overview

We developed a weight-only quantization method specialized for the Mixture-of-Experts (MoE) architecture, and we release Qwen3-30B-A3B quantized with our algorithm. The quantized weights are packed using an AutoRound-based quantization format.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "nota-ai/Qwen3-30B-A3B-NotaMoEQuant-Int4"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
prompt = "What is large language model?"
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True 
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=100
)

print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))

Performance

	PPL (WikiText2)	MMLU-Pro	AIME25	LiveCodeBench v6	Total TPS (Tokens/Sec.)	Memory (GB)
Qwen3-30B-A3B (BF16)	10.8955	75.47	70.00	55.70	1136.56	58.23
Nota MoEQuant (INT4)	11.3046	74.84	70.00	60.18	1262.21	16.01

Note

Nota MoEQuant use 8-bit quantization for the gate layer, while all other linear layers are quantized to 4 bits.
Tokens per sec. (TPS) is measured with 16 requests, using 20,000 tokens for prefill and 20,000 tokens for decoding.
Memory indicates the allocated GPU memory for model parameters.
Model evaluations weres conducted using AutoRound==0.8.0 and vLLM==0.12.0.