Commits · natasa365/whisper.cpp

CUDA: fix negative KV_max values in FA (llama/15321)

6e3a7b6

JohannesGaessler commited on Aug 14

llama : add gpt-oss (llama/15091)

bf225d6

ggerganov

ngxson HF Staff slaren commited on Aug 5

CUDA: skip masked KV slices for all FA kernels (llama/14924)

0c60f80

JohannesGaessler commited on Jul 30

HIP: remove the use of __HIP_PLATFORM_AMD__, explicitly support only AMD targets (llama/14945)

e37eff3

uvos commited on Jul 29

CUDA: fix overflow in FA, tune performance (llama/14840)

10ac92f

JohannesGaessler commited on Jul 23

CUDA: fix quantized KV cache + multiple sequences (llama/14822)

88864af

JohannesGaessler

ggerganov commited on Jul 23

llama : add high-throughput mode (llama/14363)

b2d73a2

ggerganov

JohannesGaessler commited on Jul 16

CUDA: broadcasting for FlashAttention mask (llama/14500)

47e02a8

JohannesGaessler commited on Jul 2

CUDA: fix FA tg at long context for CC >= 8.9 (llama/13852)

d9bd7ce

JohannesGaessler commited on May 28

CUDA: faster Deepseek FA, add Turing support (llama/13435)

ace16dc

JohannesGaessler commited on May 14

CUDA: FA support for Deepseek (Ampere or newer) (llama/13306)

507d30c

JohannesGaessler commited on May 9

CUDA: fix bad asserts for partial offload (llama/13337)

23e676b

JohannesGaessler commited on May 6

musa: fix compilation warnings in mp_22/31 (llama/12780)

090ad80

R0CKSTAR commited on Apr 6

musa: fix all warnings, re-enable `-DLLAMA_FATAL_WARNINGS=ON` in ci and update doc (llama/12611)

12bb60d

R0CKSTAR commited on Mar 30

CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (llama/12183)

3a7ca19

Gaurav Garg

JohannesGaessler commited on Mar 19

CUDA/HIP: Fix fattn-vec-* when device warp size is not 32 (llama/12315)

2adc060

uvos commited on Mar 12

HIP: implement FlashAttention via rocWMMA for CDNA and RDNA3+ (llama/12032)

a027c1d

David Huang commited on Mar 3

CUDA: optimize FA for GQA + large batches (llama/12014)

6662d54

JohannesGaessler commited on Feb 22

CUDA: use async data loading for FlashAttention (llama/11894)

5b9980d

JohannesGaessler Diego Devesa commited on Feb 17

HIP: fix flash_attn_stream_k_fixup warning (llama/11604)

acfd94f

JohannesGaessler commited on Feb 2

CUDA: use mma PTX instructions for FlashAttention (llama/11583)

f328957

JohannesGaessler Diego Devesa commited on Feb 2

ggml : build backends as libraries (llama/10256)

3dc93f3

Diego Devesa

ggerganov R0CKSTAR commited on Nov 14, 2024

CPU/CUDA: Gemma 2 FlashAttention support (llama/8542)

fb8ae8b

JohannesGaessler commited on Aug 24, 2024

ggml : reduce hash table reset cost (llama/8698)

9808fbf

slaren commited on Jul 27, 2024

CUDA: MMQ support for iq4_nl, iq4_xs (llama/8278)

8411e3c

JohannesGaessler commited on Jul 5, 2024

CUDA: refactor and optimize IQ MMVQ (llama/8215)

afa1447

JohannesGaessler commited on Jul 1, 2024

whisper : reorganize source code + improve CMake (#2256)

f75c2e3
unverified

ggerganov commited on Jun 26, 2024

Spaces:

natasa365
/

whisper.cpp

Sleeping

Commit History

CUDA: fix negative KV_max values in FA (llama/15321)

6e3a7b6

llama : add gpt-oss (llama/15091)

bf225d6

CUDA: skip masked KV slices for all FA kernels (llama/14924)

0c60f80

HIP: remove the use of __HIP_PLATFORM_AMD__, explicitly support only AMD targets (llama/14945)

e37eff3

CUDA: fix overflow in FA, tune performance (llama/14840)

10ac92f

CUDA: fix quantized KV cache + multiple sequences (llama/14822)

88864af

llama : add high-throughput mode (llama/14363)

b2d73a2

CUDA: broadcasting for FlashAttention mask (llama/14500)

47e02a8

CUDA: fix FA tg at long context for CC >= 8.9 (llama/13852)

d9bd7ce

CUDA: faster Deepseek FA, add Turing support (llama/13435)

ace16dc

CUDA: FA support for Deepseek (Ampere or newer) (llama/13306)

507d30c

CUDA: fix bad asserts for partial offload (llama/13337)

23e676b

musa: fix compilation warnings in mp_22/31 (llama/12780)

090ad80

musa: fix all warnings, re-enable `-DLLAMA_FATAL_WARNINGS=ON` in ci and update doc (llama/12611)

12bb60d

CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (llama/12183)

3a7ca19

CUDA/HIP: Fix fattn-vec-* when device warp size is not 32 (llama/12315)

2adc060

HIP: implement FlashAttention via rocWMMA for CDNA and RDNA3+ (llama/12032)

a027c1d

CUDA: optimize FA for GQA + large batches (llama/12014)

6662d54

CUDA: use async data loading for FlashAttention (llama/11894)

5b9980d

HIP: fix flash_attn_stream_k_fixup warning (llama/11604)

acfd94f

CUDA: use mma PTX instructions for FlashAttention (llama/11583)

f328957

ggml : build backends as libraries (llama/10256)

3dc93f3

CPU/CUDA: Gemma 2 FlashAttention support (llama/8542)

fb8ae8b

ggml : reduce hash table reset cost (llama/8698)

9808fbf

CUDA: MMQ support for iq4_nl, iq4_xs (llama/8278)

8411e3c

CUDA: refactor and optimize IQ MMVQ (llama/8215)

afa1447

whisper : reorganize source code + improve CMake (#2256)

f75c2e3
unverified

Commit History

CUDA: fix negative KV_max values in FA (llama/15321) 6e3a7b6

llama : add gpt-oss (llama/15091) bf225d6

CUDA: skip masked KV slices for all FA kernels (llama/14924) 0c60f80

HIP: remove the use of __HIP_PLATFORM_AMD__, explicitly support only AMD targets (llama/14945) e37eff3

CUDA: fix overflow in FA, tune performance (llama/14840) 10ac92f

CUDA: fix quantized KV cache + multiple sequences (llama/14822) 88864af

llama : add high-throughput mode (llama/14363) b2d73a2

CUDA: broadcasting for FlashAttention mask (llama/14500) 47e02a8

CUDA: fix FA tg at long context for CC >= 8.9 (llama/13852) d9bd7ce

CUDA: faster Deepseek FA, add Turing support (llama/13435) ace16dc

CUDA: FA support for Deepseek (Ampere or newer) (llama/13306) 507d30c

CUDA: fix bad asserts for partial offload (llama/13337) 23e676b

musa: fix compilation warnings in mp_22/31 (llama/12780) 090ad80

musa: fix all warnings, re-enable `-DLLAMA_FATAL_WARNINGS=ON` in ci and update doc (llama/12611) 12bb60d

CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (llama/12183) 3a7ca19

CUDA/HIP: Fix fattn-vec-* when device warp size is not 32 (llama/12315) 2adc060

HIP: implement FlashAttention via rocWMMA for CDNA and RDNA3+ (llama/12032) a027c1d

CUDA: optimize FA for GQA + large batches (llama/12014) 6662d54

CUDA: use async data loading for FlashAttention (llama/11894) 5b9980d

HIP: fix flash_attn_stream_k_fixup warning (llama/11604) acfd94f

CUDA: use mma PTX instructions for FlashAttention (llama/11583) f328957

ggml : build backends as libraries (llama/10256) 3dc93f3

CPU/CUDA: Gemma 2 FlashAttention support (llama/8542) fb8ae8b

ggml : reduce hash table reset cost (llama/8698) 9808fbf

CUDA: MMQ support for iq4_nl, iq4_xs (llama/8278) 8411e3c

CUDA: refactor and optimize IQ MMVQ (llama/8215) afa1447

whisper : reorganize source code + improve CMake (#2256) f75c2e3 unverified

CUDA: fix negative KV_max values in FA (llama/15321)

6e3a7b6

llama : add gpt-oss (llama/15091)

bf225d6

CUDA: skip masked KV slices for all FA kernels (llama/14924)

0c60f80

HIP: remove the use of __HIP_PLATFORM_AMD__, explicitly support only AMD targets (llama/14945)

e37eff3

CUDA: fix overflow in FA, tune performance (llama/14840)

10ac92f

CUDA: fix quantized KV cache + multiple sequences (llama/14822)

88864af

llama : add high-throughput mode (llama/14363)

b2d73a2

CUDA: broadcasting for FlashAttention mask (llama/14500)

47e02a8

CUDA: fix FA tg at long context for CC >= 8.9 (llama/13852)

d9bd7ce

CUDA: faster Deepseek FA, add Turing support (llama/13435)

ace16dc

CUDA: FA support for Deepseek (Ampere or newer) (llama/13306)

507d30c

CUDA: fix bad asserts for partial offload (llama/13337)

23e676b

musa: fix compilation warnings in mp_22/31 (llama/12780)

090ad80

musa: fix all warnings, re-enable `-DLLAMA_FATAL_WARNINGS=ON` in ci and update doc (llama/12611)

12bb60d

CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (llama/12183)

3a7ca19

CUDA/HIP: Fix fattn-vec-* when device warp size is not 32 (llama/12315)

2adc060

HIP: implement FlashAttention via rocWMMA for CDNA and RDNA3+ (llama/12032)

a027c1d

CUDA: optimize FA for GQA + large batches (llama/12014)

6662d54

CUDA: use async data loading for FlashAttention (llama/11894)

5b9980d

HIP: fix flash_attn_stream_k_fixup warning (llama/11604)

acfd94f

CUDA: use mma PTX instructions for FlashAttention (llama/11583)

f328957

ggml : build backends as libraries (llama/10256)

3dc93f3

CPU/CUDA: Gemma 2 FlashAttention support (llama/8542)

fb8ae8b

ggml : reduce hash table reset cost (llama/8698)

9808fbf

CUDA: MMQ support for iq4_nl, iq4_xs (llama/8278)

8411e3c

CUDA: refactor and optimize IQ MMVQ (llama/8215)

afa1447

whisper : reorganize source code + improve CMake (#2256)

f75c2e3
unverified