QuantTrio/GLM-4.7-AWQ · Once again Thanks, here is my review for 8 x RTX 5090 setup

6 days ago

for RTX 5090 setup , I observed slower TPS when enabled MTP
--speculative-config.method mtp
--speculative-config.num_speculative_tokens 1
Avg 25 TPS

With limited Vram of 256GB , I think RTX 5090 users should turn it off for more KV Cache and better performance?

Once I disabled MTP I attained 35-41 TPS (ubuntu GDM, I expect a little more TPS if headless setup)

crystech

6 days ago

---- Without MTP enabled -----
(APIServer pid=21036) INFO: 127.0.0.1:50526 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=21036) INFO 12-25 02:01:12 [loggers.py:257] Engine 000: Avg prompt throughput: 1383.7 tokens/s, Avg generation throughput: 13.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.1%, Prefix cache hit rate: 0.0%
(APIServer pid=21036) INFO 12-25 02:01:22 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 40.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.3%, Prefix cache hit rate: 0.0%
(APIServer pid=21036) INFO 12-25 02:01:32 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 41.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.6%, Prefix cache hit rate: 0.0%

---with MTP enabled ------
(APIServer pid=19990) INFO: Application startup complete.
(APIServer pid=19990) INFO: 127.0.0.1:35818 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=19990) INFO 12-25 01:51:58 [loggers.py:257] Engine 000: Avg prompt throughput: 1260.1 tokens/s, Avg generation throughput: 13.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.3%, Prefix cache hit rate: 0.0%
(APIServer pid=19990) INFO 12-25 01:51:58 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.89, Accepted throughput: 0.77 tokens/s, Drafted throughput: 0.86 tokens/s, Accepted: 62 tokens, Drafted: 70 tokens, Per-position acceptance rate: 0.886, Avg Draft acceptance rate: 88.6%
(APIServer pid=19990) INFO 12-25 01:52:08 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 27.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.6%, Prefix cache hit rate: 0.0%
(APIServer pid=19990) INFO 12-25 01:52:08 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.78, Accepted throughput: 11.90 tokens/s, Drafted throughput: 15.30 tokens/s, Accepted: 119 tokens, Drafted: 153 tokens, Per-position acceptance rate: 0.778, Avg Draft acceptance rate: 77.8%
(APIServer pid=19990) INFO 12-25 01:52:18 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 27.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.8%, Prefix cache hit rate: 0.0%
(APIServer pid=19990) INFO 12-25 01:52:18 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.78, Accepted throughput: 11.90 tokens/s, Drafted throughput: 15.20 tokens/s, Accepted: 119 tokens, Drafted: 152 tokens, Per-position acceptance rate: 0.783, Avg Draft acceptance rate: 78.3%
(APIServer pid=19990) INFO 12-25 01:52:28 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 28.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.0%, Prefix cache hit rate: 0.0%
(APIServer pid=19990) INFO 12-25 01:52:28 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.79, Accepted throughput: 12.60 tokens/s, Drafted throughput: 15.90 tokens/s, Accepted: 126 tokens, Drafted: 159 tokens, Per-position acceptance rate: 0.792, Avg Draft acceptance rate: 79.2%
(APIServer pid=19990) INFO 12-25 01:52:38 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=19990) INFO 12-25 01:52:38 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 2.00, Accepted throughput: 0.30 tokens/s, Drafted throughput: 0.30 tokens/s, Accepted: 3 tokens, Drafted: 3 tokens, Per-position acceptance rate: 1.000, Avg Draft acceptance rate: 100.0%
(APIServer pid=19990) INFO 12-25 01:52:48 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

am i doing anything wrong? just in case

tclf90

QuantTrio org 5 days ago

emmm I guess you would have to let go "export VLLM_USE_DEEP_GEMM=0", and give it a try. But I'm not sure.

tclf90

QuantTrio org 5 days ago

I ran a test on my 8x4090(48GB) rig:

---- w/o MTP enabled -----

(APIServer pid=143220) INFO 12-25 15:24:40 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 53.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.2%, Prefix cache hit rate: 0.0%
(APIServer pid=143220) INFO 12-25 15:24:50 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 54.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.3%, Prefix cache hit rate: 0.0%
(APIServer pid=143220) INFO 12-25 15:25:00 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 53.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.4%, Prefix cache hit rate: 0.0%

---with MTP enabled ------

(APIServer pid=136891) INFO 12-25 15:17:58 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 73.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.2%, Prefix cache hit rate: 0.0%
(APIServer pid=136891) INFO 12-25 15:17:58 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.66, Accepted throughput: 29.30 tokens/s, Drafted throughput: 44.39 tokens/s, Accepted: 293 tokens, Drafted: 444 tokens, Per-position acceptance rate: 0.660, Avg Draft acceptance rate: 66.0%
(APIServer pid=136891) INFO 12-25 15:18:08 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 66.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.4%, Prefix cache hit rate: 0.0%
(APIServer pid=136891) INFO 12-25 15:18:08 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.63, Accepted throughput: 25.80 tokens/s, Drafted throughput: 40.80 tokens/s, Accepted: 258 tokens, Drafted: 408 tokens, Per-position acceptance rate: 0.632, Avg Draft acceptance rate: 63.2%
(APIServer pid=136891) INFO 12-25 15:18:18 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 62.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
(APIServer pid=136891) INFO 12-25 15:18:18 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.65, Accepted throughput: 24.60 tokens/s, Drafted throughput: 38.10 tokens/s, Accepted: 246 tokens, Drafted: 381 tokens, Per-position acceptance rate: 0.646, Avg Draft acceptance rate: 64.6%
(APIServer pid=136891) INFO 12-25 15:18:28 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 58.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.7%, Prefix cache hit rate: 0.0%

My results look reasonable, for reference.

crystech

5 days ago

Thanks, tried and still worst. i will disable MTP for my setup. not sure what causes that to slow down

mratsim

5 days ago

•

edited 5 days ago

Thanks, tried and still worst. i will disable MTP for my setup. not sure what causes that to slow down

I had the same issue with speculative decoding on GLM-4.5V-FP8 (official quant) with RTX Pro 6000: https://github.com/vllm-project/vllm/issues/26838#issuecomment-3563172299 so I'm not sure if it's the quant itself that's problematic.

Edit: Ah, but in this specific unofficial quant case, a MTP layer would predict what the unquantized or the FP8 models would output, it would need to be retrained for a quant and obviously this is quite complex. So for any quant, the MTP layer likely should be stripped or mispredictions would lead to more work.

tclf90

QuantTrio org 5 days ago

Thanks, tried and still worst. i will disable MTP for my setup. not sure what causes that to slow down

in your case, the most prominent issue is, your 8x5090 runs slower than my 8x4090, even without mtp enabled...

my rig runs on 53.9 tokens/s, but yours 40.0 tokens/s

tclf90

QuantTrio org 5 days ago

Thanks, tried and still worst. i will disable MTP for my setup. not sure what causes that to slow down

I had the same issue with speculative decoding on GLM-4.5V-FP8 (official quant) with RTX Pro 6000: https://github.com/vllm-project/vllm/issues/26838#issuecomment-3563172299 so I'm not sure if it's the quant itself that's problematic.

Edit: Ah, but in this specific unofficial quant case, a MTP layer would predict what the unquantized or the FP8 models would output, it would need to be retrained for a quant and obviously this is quite complex. So for any quant, the MTP layer likely should be stripped or mispredictions would lead to more work.

"mispredictions would lead to more work." at the least, this is not the case, as you can see from Avg Draft acceptance rate, which is comparable to that of the bf16 version.

mratsim

5 days ago

"mispredictions would lead to more work." at the least, this is not the case, as you can see from Avg Draft acceptance rate, which is comparable to that of the bf16 version.

Ah good catch

mtcl

4 days ago

I ran a test on my 8x4090(48GB) rig:

will i be able to run it on 2X6000 Pro (192 GB total VRAM) rig?

How do i quantize it to be under that ram?