JohannesGaessler's picture
CUDA: fix Pascal FA, deq. KV to FP16 for batch > 8 (llama/7681)
d4c0faf