Instructions to use AI-MO/NuminaMath-7B-TIR with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AI-MO/NuminaMath-7B-TIR with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="AI-MO/NuminaMath-7B-TIR") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("AI-MO/NuminaMath-7B-TIR") model = AutoModelForCausalLM.from_pretrained("AI-MO/NuminaMath-7B-TIR") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use AI-MO/NuminaMath-7B-TIR with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "AI-MO/NuminaMath-7B-TIR" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AI-MO/NuminaMath-7B-TIR", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/AI-MO/NuminaMath-7B-TIR
- SGLang
How to use AI-MO/NuminaMath-7B-TIR with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "AI-MO/NuminaMath-7B-TIR" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AI-MO/NuminaMath-7B-TIR", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "AI-MO/NuminaMath-7B-TIR" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AI-MO/NuminaMath-7B-TIR", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use AI-MO/NuminaMath-7B-TIR with Docker Model Runner:
docker model run hf.co/AI-MO/NuminaMath-7B-TIR
My alternative quantizations.
These are my own quantizations (updated almost daily).
The difference with normal quantizations is that I quantize the output and embed tensors to f16.
and the other tensors to 15_k,q6_k or q8_0.
This creates models that are little or not degraded at all and have a smaller size.
They run at about 3-6 t/sec on CPU only using llama.cpp
And obviously faster on computers with potent GPUs
My tests show that F16 performs better than Q8_0:
https://github.com/foldl/chatllm.cpp/blob/master/docs/tool_calling.md#numinamath
@J22
that's normal.. but check how q6_k and q5_k perform compared to q8_p and q8_0 :D
and then check how q6_k and q5_k perform compared to f16.
you'll be surprised.
yes. indeed. but there is very little degradation in the others because also the others have the most important tensors still at f16 while the rest of the tensors are at the specified quantization.
yes. indeed. but there is very little degradation in the others because also the others have the most important tensors still at f16 while the rest of the tensors are at the specified quantization.
Got it. I have been trying to use the f_16 model with open-webui but have been unable to run inference on it (I was able to do it for q6_k); I'm guessing this is a open-webui issue where it either fails/is extemely slow when dealing with LLMs which are a bit large (more than 8GB I think).
Thanks for the quick reply!