Spaces:
Sleeping
Sleeping
Commit History
CUDA: skip masked KV slices for all FA kernels (llama/14924)
0c60f80
HIP: remove the use of __HIP_PLATFORM_AMD__, explicitly support only AMD targets (llama/14945)
e37eff3
CUDA: fix overflow in FA, tune performance (llama/14840)
10ac92f
CUDA: fix quantized KV cache + multiple sequences (llama/14822)
88864af
llama : add high-throughput mode (llama/14363)
b2d73a2
CUDA: broadcasting for FlashAttention mask (llama/14500)
47e02a8
CUDA: fix FA tg at long context for CC >= 8.9 (llama/13852)
d9bd7ce
CUDA: faster Deepseek FA, add Turing support (llama/13435)
ace16dc
CUDA: FA support for Deepseek (Ampere or newer) (llama/13306)
507d30c
CUDA: fix bad asserts for partial offload (llama/13337)
23e676b
musa: fix compilation warnings in mp_22/31 (llama/12780)
090ad80
R0CKSTAR
commited on
musa: fix all warnings, re-enable `-DLLAMA_FATAL_WARNINGS=ON` in ci and update doc (llama/12611)
12bb60d
R0CKSTAR
commited on
CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (llama/12183)
3a7ca19
CUDA/HIP: Fix fattn-vec-* when device warp size is not 32 (llama/12315)
2adc060
uvos
commited on
HIP: implement FlashAttention via rocWMMA for CDNA and RDNA3+ (llama/12032)
a027c1d
David Huang
commited on
CUDA: optimize FA for GQA + large batches (llama/12014)
6662d54
CUDA: use async data loading for FlashAttention (llama/11894)
5b9980d
HIP: fix flash_attn_stream_k_fixup warning (llama/11604)
acfd94f
CUDA: use mma PTX instructions for FlashAttention (llama/11583)
f328957
ggml : build backends as libraries (llama/10256)
3dc93f3
CPU/CUDA: Gemma 2 FlashAttention support (llama/8542)
fb8ae8b
ggml : reduce hash table reset cost (llama/8698)
9808fbf
slaren
commited on