Instructions to use arcee-ai/Mistral-7B-Instruct-v0.2-sliced-24-layer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use arcee-ai/Mistral-7B-Instruct-v0.2-sliced-24-layer with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="arcee-ai/Mistral-7B-Instruct-v0.2-sliced-24-layer") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("arcee-ai/Mistral-7B-Instruct-v0.2-sliced-24-layer") model = AutoModelForCausalLM.from_pretrained("arcee-ai/Mistral-7B-Instruct-v0.2-sliced-24-layer") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use arcee-ai/Mistral-7B-Instruct-v0.2-sliced-24-layer with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "arcee-ai/Mistral-7B-Instruct-v0.2-sliced-24-layer" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "arcee-ai/Mistral-7B-Instruct-v0.2-sliced-24-layer", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/arcee-ai/Mistral-7B-Instruct-v0.2-sliced-24-layer
- SGLang
How to use arcee-ai/Mistral-7B-Instruct-v0.2-sliced-24-layer with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "arcee-ai/Mistral-7B-Instruct-v0.2-sliced-24-layer" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "arcee-ai/Mistral-7B-Instruct-v0.2-sliced-24-layer", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "arcee-ai/Mistral-7B-Instruct-v0.2-sliced-24-layer" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "arcee-ai/Mistral-7B-Instruct-v0.2-sliced-24-layer", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use arcee-ai/Mistral-7B-Instruct-v0.2-sliced-24-layer with Docker Model Runner:
docker model run hf.co/arcee-ai/Mistral-7B-Instruct-v0.2-sliced-24-layer
Quick Summary
This model is an adaptation of the mistralai/Mistral-7B-Instruct-v0.2, refined through the application of layer pruning techniques as detailed in the paper "The Unreasonable Ineffectiveness of the Deeper Layers." It incorporates methodologies from the MergeKit and PruneMe repositories to optimize its structure, focusing on reducing redundancy within the model's deeper layers without compromising its ability to generate coherent text. The model is maintained by Arcee-ai and represents a practical implementation of computational efficiency improvements in Large Language Models (LLMs), aiming to balance performance with resource usage effectively.
Model Description
This model represents a specialized iteration of the mistralai/Mistral-7B-Instruct-v0.2, optimized for efficiency and performance through selective layer pruning. Developed by Arcee-ai, it leverages insights from the "The Unreasonable Ineffectiveness of the Deeper Layers" research. The pruning process was informed by the MergeKit and PruneMe tools, focusing on eliminating redundant layers to ensure a leaner, more efficient model capable of generating high-quality text outputs.
Model Sources
- Pruning: PruneMe GitHub (unofficial)
- Paper: "The Unreasonable Ineffectiveness of the Deeper Layers"
- Merging Repository: MergeKit GitHub
Uses
This pruned model is designed for a range of NLP tasks, with a focus on maintaining or even enhancing the model's original capabilities in generating coherent text, despite the reduction in its size. It stands as a testament to the feasibility of layer pruning in preserving the essential functional attributes of a model while offering a template for computational resource optimization.
Downstream Use
The pruned model serves as a robust foundation for fine-tuning on specific tasks and is an ideal candidate for exploring continuous pre-training opportunities. Its development is a direct application of principles outlined in "The Unreasonable Ineffectiveness of the Deeper Layers," utilizing the MergeKit and PruneMe repositories for practical pruning implementation. This model is a step forward in efficient model design, demonstrating the potential for significant reductions in computational resource requirements without detrimental effects on performance.
- Downloads last month
- 12