--- library_name: transformers tags: - pytorch - neollm - hybrid-attention - fanformer - gated-delta-networks - polynomial-activations - fineweb-edu - ademamix - custom-scheduler - flash-attention - torch-compile pipeline_tag: text-generation model-index: - name: NeoLLM results: - task: type: text-generation name: Text Generation dataset: type: multiple-choice name: ARC-Easy metrics: - type: accuracy value: 39.14 - task: type: text-generation name: Text Generation dataset: type: multiple-choice name: HellaSwag metrics: - type: accuracy value: 26.55 - task: type: text-generation name: Text Generation dataset: type: multiple-choice name: MMLU metrics: - type: accuracy value: 24.25 - task: type: text-generation name: Text Generation dataset: type: multiple-choice name: ARC-Challenge metrics: - type: accuracy value: 17.24 license: apache-2.0 datasets: - HuggingFaceFW/fineweb-edu language: - en --- # NeoLLM NeoLLM is a hybrid architecture language model that combines multiple state-of-the-art techniques for efficient and effective language modeling. This 110M parameter model demonstrates novel architectural innovations including Fourier Analysis Networks, hybrid attention mechanisms, and advanced normalization techniques. ## Model Description NeoLLM incorporates several cutting-edge components: - **FANformer Integration**: Fourier Analysis Network (FAN) layers for effective periodicity modeling with fan_ratio of 0.125 - **Hybrid Attention Architecture**: Follows Qwen3-Next's approach with 1 full attention layer per 3 linear attention layers - **Polynomial Composition Activations**: PolyNorm activation functions in MLP layers for enhanced dynamics - **Advanced Normalization**: LayerNorm Scaling (LNS) and Gradient-Preserving Activation Scaling (GPAS) - **Efficient Linear Attention**: Gated Delta Networks for improved computational efficiency ### Architecture Details - **Model Size**: 110M parameters (77M embeddings + 33M non-embeddings) - **Hidden Size**: 512 - **Layers**: 12 layers with hybrid attention pattern - **Attention Heads**: 8 (with 2 KV heads using Grouped Query Attention) - **Intermediate Size**: 1024 - **Sequence Length**: 512 tokens - **Vocabulary**: 151,665 tokens (Qwen3 tokenizer) ### Layer Pattern The model uses a hybrid attention pattern where layers alternate between: - **Linear Attention**: Layers 1,2,3,5,6,7,9,10,11 (Gated Delta Networks) - **Full Attention**: Layers 4,8,12 (Flash Attention 2) ## Training Details ### Dataset - **Source**: FineWeb-Edu (sample-10BT subset) - **Training Samples**: 4 million examples - **Validation Split**: 1% (40,000 samples) - **Text Processing**: Dynamic truncation to 4x block_size during tokenization - **Tokenizer**: Qwen3 Fast Tokenizer with weight tying enabled ### Training Configuration - **Hardware**: NVIDIA RTX 5090 - **Training Time**: 3 hours - **Loss Function**: Cut Your Losses (from "Cut Your Losses in Large-Vocabulary Language Models") - NOT standard Cross-Entropy - **Optimizer**: AdEMAMix with parameters: - Betas: (0.9, 0.999, 0.999) - Alpha: 5.0 - t_alpha: 5000, t_beta3: 5000 - Weight decay: 0.1 - **Learning Rate Schedule**: Custom cosine with linear warmup - Start LR: 3e-4 - Peak LR: 6e-4 (at 5000 warmup steps) - Min LR: 6e-5 - **Batch Size**: 64 per device - **Precision**: BF16 with torch.compile optimization - **Hardware Optimizations**: Flash Attention 2 - **Epochs**: 1 ### Framework Versions - **PyTorch**: 2.8.0+cu129 - **Transformers**: 4.57.0.dev0 - **Flash Attention**: 2.x - **CUDA**: 12.9 ## Evaluation Results ### Benchmark Performance (1-shot evaluation) | Task | Score | |------|-------| | ARC-Easy | 39.14% | | HellaSwag | 26.55% | | MMLU | 24.25% | | ARC-Challenge | 17.24% | *All evaluations performed in few-shot (1-shot) setting* ## Model Architecture Components ### Fourier Analysis Network (FANLayer) Based on "FANformer: Improving Large Language Models Through Effective Periodicity Modeling": ``` FANLayer'(X) = [cos(WpX)||sin(WpX)||(WpX + Bp)] ``` ### LayerNorm Scaling (LNS) Implements scaling factor 1/√ℓ as described in "The Curse of Depth in Large Language Models": ``` h^(ℓ) = LayerNorm(h^(ℓ)) × (1/√ℓ) ``` ### Gradient-Preserving Activation Scaling (GPAS) Scales activations without penalizing gradients using stop-gradient operations. ### Polynomial Composition Activations (PolyNorm) Custom activation function based on "Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models". ### Gated Delta Networks Linear attention mechanism from "Gated Delta Networks: Improving Mamba2 with Delta Rule" for efficient sequence modeling. ## Intended Uses & Limitations ### Intended Uses - Research into hybrid attention architectures - Educational purposes for understanding advanced LLM components - Small-scale language modeling experiments - Benchmarking novel architectural components ### Limitations - Relatively small model size (110M parameters) limits capability compared to larger models - Training limited to 4M samples from single dataset - Performance below state-of-the-art models on standard benchmarks - Experimental architecture may have stability considerations in production ### Recommendations - Best suited for research and educational applications - Consider fine-tuning for specific downstream tasks - Monitor performance carefully if adapting for production use ## Training Infrastructure - **Mixed Precision**: BF16 for numerical stability - **Compilation**: torch.compile with max-autotune mode