Instructions to use AI-MO/NuminaMath-7B-TIR with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use AI-MO/NuminaMath-7B-TIR with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="AI-MO/NuminaMath-7B-TIR")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("AI-MO/NuminaMath-7B-TIR")
model = AutoModelForCausalLM.from_pretrained("AI-MO/NuminaMath-7B-TIR")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use AI-MO/NuminaMath-7B-TIR with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "AI-MO/NuminaMath-7B-TIR"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AI-MO/NuminaMath-7B-TIR",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/AI-MO/NuminaMath-7B-TIR

SGLang

How to use AI-MO/NuminaMath-7B-TIR with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "AI-MO/NuminaMath-7B-TIR" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AI-MO/NuminaMath-7B-TIR",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "AI-MO/NuminaMath-7B-TIR" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AI-MO/NuminaMath-7B-TIR",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use AI-MO/NuminaMath-7B-TIR with Docker Model Runner:
```
docker model run hf.co/AI-MO/NuminaMath-7B-TIR
```

My alternative quantizations.

by ZeroWw - opened Jul 12, 2024

Discussion

ZeroWw

Jul 12, 2024

•

edited Jul 23, 2024

These are my own quantizations (updated almost daily).

The difference with normal quantizations is that I quantize the output and embed tensors to f16.
and the other tensors to 15_k,q6_k or q8_0.
This creates models that are little or not degraded at all and have a smaller size.
They run at about 3-6 t/sec on CPU only using llama.cpp
And obviously faster on computers with potent GPUs

ZeroWw/NuminaMath-7B-TIR-GGUF

J22

Jul 14, 2024

My tests show that F16 performs better than Q8_0:

https://github.com/foldl/chatllm.cpp/blob/master/docs/tool_calling.md#numinamath

ZeroWw

Jul 15, 2024

•

edited Jul 15, 2024

@J22
that's normal.. but check how q6_k and q5_k perform compared to q8_p and q8_0 :D

and then check how q6_k and q5_k perform compared to f16.

you'll be surprised.

Haleshot

Jul 27, 2024

@J22
that's normal.. but check how q6_k and q5_k perform compared to q8_p and q8_0 :D

and then check how q6_k and q5_k perform compared to f16.

you'll be surprised.

Would you say that F16 has the best output out of all the quantized formats?

ZeroWw

Jul 27, 2024

yes. indeed. but there is very little degradation in the others because also the others have the most important tensors still at f16 while the rest of the tensors are at the specified quantization.

Haleshot

Jul 27, 2024

yes. indeed. but there is very little degradation in the others because also the others have the most important tensors still at f16 while the rest of the tensors are at the specified quantization.

Got it. I have been trying to use the f_16 model with open-webui but have been unable to run inference on it (I was able to do it for q6_k); I'm guessing this is a open-webui issue where it either fails/is extemely slow when dealing with LLMs which are a bit large (more than 8GB I think).
Thanks for the quick reply!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment