library_name: transformers
tags:
- pytorch
- neollm
- hybrid-attention
- fanformer
- gated-delta-networks
- polynomial-activations
- fineweb-edu
- ademamix
- custom-scheduler
- flash-attention
- torch-compile
pipeline_tag: text-generation
model-index:
- name: NeoLLM
results:
- task:
type: text-generation
name: Text Generation
dataset:
type: multiple-choice
name: ARC-Easy
metrics:
- type: accuracy
value: 39.14
- task:
type: text-generation
name: Text Generation
dataset:
type: multiple-choice
name: HellaSwag
metrics:
- type: accuracy
value: 26.55
- task:
type: text-generation
name: Text Generation
dataset:
type: multiple-choice
name: MMLU
metrics:
- type: accuracy
value: 24.25
- task:
type: text-generation
name: Text Generation
dataset:
type: multiple-choice
name: ARC-Challenge
metrics:
- type: accuracy
value: 17.24
license: apache-2.0
datasets:
- HuggingFaceFW/fineweb-edu
language:
- en
NeoLLM
NeoLLM is a hybrid architecture language model that combines multiple state-of-the-art techniques for efficient and effective language modeling. This 110M parameter model demonstrates novel architectural innovations including Fourier Analysis Networks, hybrid attention mechanisms, and advanced normalization techniques.
Model Description
NeoLLM incorporates several cutting-edge components:
- FANformer Integration: Fourier Analysis Network (FAN) layers for effective periodicity modeling with fan_ratio of 0.125
- Hybrid Attention Architecture: Follows Qwen3-Next's approach with 1 full attention layer per 3 linear attention layers
- Polynomial Composition Activations: PolyNorm activation functions in MLP layers for enhanced dynamics
- Advanced Normalization: LayerNorm Scaling (LNS) and Gradient-Preserving Activation Scaling (GPAS)
- Efficient Linear Attention: Gated Delta Networks for improved computational efficiency
Architecture Details
- Model Size: 110M parameters (77M embeddings + 33M non-embeddings)
- Hidden Size: 512
- Layers: 12 layers with hybrid attention pattern
- Attention Heads: 8 (with 2 KV heads using Grouped Query Attention)
- Intermediate Size: 1024
- Sequence Length: 512 tokens
- Vocabulary: 151,665 tokens (Qwen3 tokenizer)
Layer Pattern
The model uses a hybrid attention pattern where layers alternate between:
- Linear Attention: Layers 1,2,3,5,6,7,9,10,11 (Gated Delta Networks)
- Full Attention: Layers 4,8,12 (Flash Attention 2)
Training Details
Dataset
- Source: FineWeb-Edu (sample-10BT subset)
- Training Samples: 4 million examples
- Validation Split: 1% (40,000 samples)
- Text Processing: Dynamic truncation to 4x block_size during tokenization
- Tokenizer: Qwen3 Fast Tokenizer with weight tying enabled
Training Configuration
- Hardware: NVIDIA RTX 5090
- Training Time: 3 hours
- Loss Function: Cut Your Losses (from "Cut Your Losses in Large-Vocabulary Language Models") - NOT standard Cross-Entropy
- Optimizer: AdEMAMix with parameters:
- Betas: (0.9, 0.999, 0.999)
- Alpha: 5.0
- t_alpha: 5000, t_beta3: 5000
- Weight decay: 0.1
- Learning Rate Schedule: Custom cosine with linear warmup
- Start LR: 3e-4
- Peak LR: 6e-4 (at 5000 warmup steps)
- Min LR: 6e-5
- Batch Size: 64 per device
- Precision: BF16 with torch.compile optimization
- Hardware Optimizations: Flash Attention 2
- Epochs: 1
Framework Versions
- PyTorch: 2.8.0+cu129
- Transformers: 4.57.0.dev0
- Flash Attention: 2.x
- CUDA: 12.9
Evaluation Results
Benchmark Performance (1-shot evaluation)
| Task | Score |
|---|---|
| ARC-Easy | 39.14% |
| HellaSwag | 26.55% |
| MMLU | 24.25% |
| ARC-Challenge | 17.24% |
All evaluations performed in few-shot (1-shot) setting
Model Architecture Components
Fourier Analysis Network (FANLayer)
Based on "FANformer: Improving Large Language Models Through Effective Periodicity Modeling":
FANLayer'(X) = [cos(WpX)||sin(WpX)||(WpX + Bp)]
LayerNorm Scaling (LNS)
Implements scaling factor 1/√ℓ as described in "The Curse of Depth in Large Language Models":
h^(ℓ) = LayerNorm(h^(ℓ)) × (1/√ℓ)
Gradient-Preserving Activation Scaling (GPAS)
Scales activations without penalizing gradients using stop-gradient operations.
Polynomial Composition Activations (PolyNorm)
Custom activation function based on "Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models".
Gated Delta Networks
Linear attention mechanism from "Gated Delta Networks: Improving Mamba2 with Delta Rule" for efficient sequence modeling.
Intended Uses & Limitations
Intended Uses
- Research into hybrid attention architectures
- Educational purposes for understanding advanced LLM components
- Small-scale language modeling experiments
- Benchmarking novel architectural components
Limitations
- Relatively small model size (110M parameters) limits capability compared to larger models
- Training limited to 4M samples from single dataset
- Performance below state-of-the-art models on standard benchmarks
- Experimental architecture may have stability considerations in production
Recommendations
- Best suited for research and educational applications
- Consider fine-tuning for specific downstream tasks
- Monitor performance carefully if adapting for production use
Training Infrastructure
- Mixed Precision: BF16 for numerical stability
- Compilation: torch.compile with max-autotune mode