HuggingFaceH4
/

zephyr-7b-beta

Text Generation

Generated from Trainer

Eval Results (legacy)

text-generation-inference

Model card Files Files and versions

Instructions to use HuggingFaceH4/zephyr-7b-beta with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use HuggingFaceH4/zephyr-7b-beta with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="HuggingFaceH4/zephyr-7b-beta")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps

How to use HuggingFaceH4/zephyr-7b-beta with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "HuggingFaceH4/zephyr-7b-beta"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceH4/zephyr-7b-beta",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/HuggingFaceH4/zephyr-7b-beta

How to use HuggingFaceH4/zephyr-7b-beta with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "HuggingFaceH4/zephyr-7b-beta" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceH4/zephyr-7b-beta",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "HuggingFaceH4/zephyr-7b-beta" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceH4/zephyr-7b-beta",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use HuggingFaceH4/zephyr-7b-beta with Docker Model Runner:
```
docker model run hf.co/HuggingFaceH4/zephyr-7b-beta
```

Resources

View closed (14)

try

#77 opened 2 months ago by

Install & run HuggingFaceH4/zephyr-7b-beta easily using llmpm

#76 opened 2 months ago by

Update README.md

#75 opened 7 months ago by

Trying to 'Use this model', getting: ' The requested model 'HuggingFaceH4/zephyr-7b-beta' is not supported by any provider you have enabled'

#74 opened 9 months ago by

For restart endpoint

#73 opened 11 months ago by

Model's End Point paused

#72 opened 11 months ago by

Potential Inconsistencies Model and Base Model License

#71 opened 12 months ago by

Update

#70 opened 12 months ago by

function calling

#69 opened about 1 year ago by

Zephyr maximum token limit

#68 opened about 1 year ago by

Getting NULL response using curl or PHP

#67 opened over 1 year ago by

Adding Evaluation Results

#66 opened over 1 year ago by

Update README.md

#65 opened over 1 year ago by

Some no safe political output for China mainland network use.

#64 opened over 1 year ago by

My_duplictemodel

#63 opened almost 2 years ago by

Model is not generating an answer or it takes a really long time

#62 opened about 2 years ago by

Update README.md

#61 opened about 2 years ago by

🚩Still receiving 'Fetch Failed' Error

#60 opened about 2 years ago by

🚩 Report: Chat Not Working

#59 opened about 2 years ago by

Internal Server Error

#58 opened about 2 years ago by

Demo different from API

#57 opened about 2 years ago by

Nodejs?Nextjs streaming

#56 opened about 2 years ago by

Zephyr is off

#55 opened about 2 years ago by

System prompts and settings from HF's Zephyr 7b-beta?

#53 opened over 2 years ago by

ParanoidPosition

Truncating Response

#52 opened over 2 years ago by

getting CUDA out of memory

#51 opened over 2 years ago by

Weird Responses?

#50 opened over 2 years ago by

zephyr-7b-beta with VLLM

#49 opened over 2 years ago by

100k is converted into $100,00

#48 opened over 2 years ago by

response is weird

#47 opened over 2 years ago by

Taking way to long to generate a response

#46 opened over 2 years ago by

Answering in Spanish

#45 opened over 2 years ago by

Update Ruined Inference

#44 opened over 2 years ago by

Error Message when the number of input tokens exceeds 2000. I am using ml.g4dn.8xlarge instance (128 GiB).

#43 opened over 2 years ago by

What EC2 configuration/instance should I use?

#41 opened over 2 years ago by

Zephyr hallucinations with conversational memory

#39 opened over 2 years ago by

Context length?

#38 opened over 2 years ago by

[AUTOMATED] Model Memory Requirements

#37 opened over 2 years ago by

model-sizer-bot

What's the difference between zephyr-7b-beta and zephyr-7b-alpha?

#36 opened over 2 years ago by

[AUTOMATED] Model Memory Requirements

#35 opened over 2 years ago by

model-sizer-bot

Can zephyr-7b support YARN 128K context window ?

#33 opened over 2 years ago by

Why isn't the `model_max_length` set to 2048?

#32 opened over 2 years ago by

Did the LoRa finetuned model end up performing the same compared to full-finetuning?

#30 opened over 2 years ago by

How do I achieve streaming output

#29 opened over 2 years ago by

BFloat16 is not supported on MPS

#27 opened over 2 years ago by

Optimize Response Length and Quality

#26 opened over 2 years ago by

Add widget examples

#25 opened over 2 years ago by

Understand reward metrics

#22 opened over 2 years ago by

Question on License given use of Ultrachat

#21 opened over 2 years ago by

Very Nice Work, But It Can't Be Prompted To Tell Stories

#19 opened over 2 years ago by deleted