Instructions to use pytorch/Phi-4-mini-instruct-INT4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use pytorch/Phi-4-mini-instruct-INT4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="pytorch/Phi-4-mini-instruct-INT4", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("pytorch/Phi-4-mini-instruct-INT4", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("pytorch/Phi-4-mini-instruct-INT4", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use pytorch/Phi-4-mini-instruct-INT4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "pytorch/Phi-4-mini-instruct-INT4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "pytorch/Phi-4-mini-instruct-INT4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/pytorch/Phi-4-mini-instruct-INT4

SGLang

How to use pytorch/Phi-4-mini-instruct-INT4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "pytorch/Phi-4-mini-instruct-INT4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "pytorch/Phi-4-mini-instruct-INT4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "pytorch/Phi-4-mini-instruct-INT4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "pytorch/Phi-4-mini-instruct-INT4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use pytorch/Phi-4-mini-instruct-INT4 with Docker Model Runner:
```
docker model run hf.co/pytorch/Phi-4-mini-instruct-INT4
```

Model Card Feedback

by marksaroufim - opened May 1, 2025

Discussion

marksaroufim

pytorch org May 1, 2025

•

edited May 1, 2025

There's a lot of frontloaded installation instructions, which will delay people getting their first inference out can you seperate them out until you need them?
Eval without the baseline is not super meaningful
Curious how come we focused only on bs=1, we had marlin and hqq kernels to help with bringing that up a bit
You show speedup numbers but likely people will be interested more in peak VRAM savings
For VLLM can you push serving instructions up and benchmarking instructions down, the overall flow should be here's to play with this model and then here's how to benchmark it
A100 is a bit of a strange benchmarking choice, why not a consumer GPU (which you can rent on vast) or a newer enterprise GPU like an H100

jerryzh168

pytorch org May 1, 2025

Eval without the baseline is not super meaningful

yeah we are still running baseline eval

Curious how come we focused only on bs=1, we had marlin and hqq kernels to help with bringing that up a bit

We also tried gemlite int4wo, and it doesn't seem to work well with batch size 128, but maybe batch size 4, 8, 16 could be OK, also there was some issues with compile that need to be resolved right now (Hicham is working on it), but in general int4wo quant method is not optimized for large batch sizes, so we decided to just release this for now, and we can release int8 dynamic quantized model for large batch size serving.

You show speedup numbers but likely people will be interested more in peak VRAM savings

good point, I'll add memory saving results.

A100 is a bit of a strange benchmarking choice, why not a consumer GPU (which you can rent on vast) or a newer enterprise GPU like an H100

not aware of which consumer GPU we should use, tinygemm is optimized for A100 I think. we do have H100 benchmarks for float8dq as well

jerryzh168 changed discussion status to closed May 3, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment