Instructions to use pytorch/Phi-4-mini-instruct-INT4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use pytorch/Phi-4-mini-instruct-INT4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="pytorch/Phi-4-mini-instruct-INT4", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("pytorch/Phi-4-mini-instruct-INT4", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("pytorch/Phi-4-mini-instruct-INT4", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use pytorch/Phi-4-mini-instruct-INT4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "pytorch/Phi-4-mini-instruct-INT4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "pytorch/Phi-4-mini-instruct-INT4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/pytorch/Phi-4-mini-instruct-INT4
- SGLang
How to use pytorch/Phi-4-mini-instruct-INT4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "pytorch/Phi-4-mini-instruct-INT4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "pytorch/Phi-4-mini-instruct-INT4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "pytorch/Phi-4-mini-instruct-INT4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "pytorch/Phi-4-mini-instruct-INT4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use pytorch/Phi-4-mini-instruct-INT4 with Docker Model Runner:
docker model run hf.co/pytorch/Phi-4-mini-instruct-INT4
Model Card Feedback
There's a lot of frontloaded installation instructions, which will delay people getting their first inference out can you seperate them out until you need them?
Eval without the baseline is not super meaningful
Curious how come we focused only on bs=1, we had marlin and hqq kernels to help with bringing that up a bit
You show speedup numbers but likely people will be interested more in peak VRAM savings
For VLLM can you push serving instructions up and benchmarking instructions down, the overall flow should be here's to play with this model and then here's how to benchmark it
A100 is a bit of a strange benchmarking choice, why not a consumer GPU (which you can rent on vast) or a newer enterprise GPU like an H100
Eval without the baseline is not super meaningful
yeah we are still running baseline eval
Curious how come we focused only on bs=1, we had marlin and hqq kernels to help with bringing that up a bit
We also tried gemlite int4wo, and it doesn't seem to work well with batch size 128, but maybe batch size 4, 8, 16 could be OK, also there was some issues with compile that need to be resolved right now (Hicham is working on it), but in general int4wo quant method is not optimized for large batch sizes, so we decided to just release this for now, and we can release int8 dynamic quantized model for large batch size serving.
You show speedup numbers but likely people will be interested more in peak VRAM savings
good point, I'll add memory saving results.
A100 is a bit of a strange benchmarking choice, why not a consumer GPU (which you can rent on vast) or a newer enterprise GPU like an H100
not aware of which consumer GPU we should use, tinygemm is optimized for A100 I think. we do have H100 benchmarks for float8dq as well