shavera commited on
Commit
345aa79
·
verified ·
1 Parent(s): 3d05d7b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -50
README.md CHANGED
@@ -13,56 +13,10 @@ tags:
13
  This is a int4_awq quantized checkpoint of bigcode/starcoder2-15b. It takes about 10GB of VRAM.
14
 
15
  ## Running this Model
16
- vLLM does not natively support autoawq currently (or any a4w8 as of writing this), so one can just serve directly from the autoawq backend.
17
-
18
- Note, if you want to start this in a container, then:
19
- `docker run --gpus all -it --name=starcoder2-15b-int4-awq -p 8000:8000 -v ~/.cache:/root/.cache nvcr.io/nvidia/pytorch:24.12-py3 bash`
20
-
21
- `pip install fastapi[all] torch transformers autoawq`
22
-
23
- Then in python3:
24
  ```
25
- from fastapi import FastAPI, HTTPException
26
- from pydantic import BaseModel
27
- import torch
28
- from awq import AutoAWQForCausalLM
29
- from transformers import AutoTokenizer
30
- import uvicorn
31
-
32
- # Define the FastAPI app
33
- app = FastAPI()
34
-
35
- # Define the request body model
36
- class TextRequest(BaseModel):
37
- text: str
38
-
39
- # Load the quantized model and tokenizer
40
- model_path = '/root/.cache/huggingface/hub/models--shavera--starcoder2-15b-w4-autoawq-gemm/snapshots/13fab46ef237de327397549f427106890e0dec67'
41
- device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
42
-
43
- model = AutoAWQForCausalLM.from_quantized(model_path, device_map="auto")
44
- tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
45
-
46
- # Ensure the model is in evaluation mode
47
- model.eval()
48
-
49
- # Create the inference function
50
- def generate_text(prompt: str) -> str:
51
- inputs = tokenizer(prompt, return_tensors="pt").to(device)
52
- outputs = model.generate(**inputs)
53
- return tokenizer.decode(outputs[0], skip_special_tokens=True)
54
-
55
- # Define the API endpoint for text generation
56
- @app.post("/generate")
57
- async def generate(request: TextRequest):
58
- try:
59
- generated_text = generate_text(request.text)
60
- return {"generated_text": generated_text}
61
- except Exception as e:
62
- raise HTTPException(status_code=500, detail=str(e))
63
-
64
- # Run the server (port 8000)
65
- if __name__ == "__main__":
66
- uvicorn.run(app, host="0.0.0.0", port=8000)
67
  ```
68
 
 
13
  This is a int4_awq quantized checkpoint of bigcode/starcoder2-15b. It takes about 10GB of VRAM.
14
 
15
  ## Running this Model
16
+ Run via docker with text-generation-interface:
 
 
 
 
 
 
 
17
  ```
18
+ docker run --gpus all --shm-size 64g -p 8080:80 -v ~/.cache/huggingface:/data \
19
+ ghcr.io/huggingface/text-generation-inference:3.1.0 \
20
+ --model-id shavera/starcoder2-15b-w4-autoawq-gemm
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
  ```
22