RAT: Bridging RNN Efficiency and Attention Accuracy in Language Modeling Paper • 2507.04416 • Published Jul 6, 2025 • 1
RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference Paper • 2602.18196 • Published 26 days ago • 1
Lost in Backpropagation: The LM Head is a Gradient Bottleneck Paper • 2603.10145 • Published 8 days ago • 10