Efficient MoE-based LLM
Collection
Mixture-of-Experts Large Language Models with Advanced Quantization • 4 items • Updated
• 23
We developed a weight-only quantization method specialized for the Mixture-of-Experts (MoE) architecture, and we release Qwen3-30B-A3B quantized with our algorithm. The quantized weights are packed using an AutoRound-based quantization format.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "nota-ai/Qwen3-30B-A3B-NotaMoEQuant-Int4"
# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
# prepare the model input
prompt = "What is large language model?"
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# conduct text completion
generated_ids = model.generate(
**model_inputs,
max_new_tokens=100
)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
| PPL (WikiText2) | MMLU-Pro | AIME25 | LiveCodeBench v6 | Total TPS (Tokens/Sec.) | Memory (GB) | |
|---|---|---|---|---|---|---|
| Qwen3-30B-A3B (BF16) | 10.8955 | 75.47 | 70.00 | 55.70 | 1136.56 | 58.23 |
| Nota MoEQuant (INT4) | 11.3046 | 74.84 | 70.00 | 60.18 | 1262.21 | 16.01 |