---
library_name: transformers
tags:
- pytorch
- neollm
- hybrid-attention
- fanformer
- gated-delta-networks
- polynomial-activations
- fineweb-edu
- ademamix
- custom-scheduler
- flash-attention
- torch-compile
pipeline_tag: text-generation
model-index:
- name: NeoLLM
  results:
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      type: multiple-choice
      name: ARC-Easy
    metrics:
    - type: accuracy
      value: 39.14
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      type: multiple-choice
      name: HellaSwag
    metrics:
    - type: accuracy
      value: 26.55
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      type: multiple-choice
      name: MMLU
    metrics:
    - type: accuracy
      value: 24.25
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      type: multiple-choice
      name: ARC-Challenge
    metrics:
    - type: accuracy
      value: 17.24
license: apache-2.0
datasets:
- HuggingFaceFW/fineweb-edu
language:
- en
---

# NeoLLM

NeoLLM is a hybrid architecture language model that combines multiple state-of-the-art techniques for efficient and effective language modeling. This 110M parameter model demonstrates novel architectural innovations including Fourier Analysis Networks, hybrid attention mechanisms, and advanced normalization techniques.

## Model Description

NeoLLM incorporates several cutting-edge components:

- **FANformer Integration**: Fourier Analysis Network (FAN) layers for effective periodicity modeling with fan_ratio of 0.125
- **Hybrid Attention Architecture**: Follows Qwen3-Next's approach with 1 full attention layer per 3 linear attention layers
- **Polynomial Composition Activations**: PolyNorm activation functions in MLP layers for enhanced dynamics
- **Advanced Normalization**: LayerNorm Scaling (LNS) and Gradient-Preserving Activation Scaling (GPAS)
- **Efficient Linear Attention**: Gated Delta Networks for improved computational efficiency


### Architecture Details

- **Model Size**: 110M parameters (77M embeddings + 33M non-embeddings)
- **Hidden Size**: 512
- **Layers**: 12 layers with hybrid attention pattern
- **Attention Heads**: 8 (with 2 KV heads using Grouped Query Attention)
- **Intermediate Size**: 1024
- **Sequence Length**: 512 tokens
- **Vocabulary**: 151,665 tokens (Qwen3 tokenizer)

### Layer Pattern
The model uses a hybrid attention pattern where layers alternate between:
- **Linear Attention**: Layers 1,2,3,5,6,7,9,10,11 (Gated Delta Networks)
- **Full Attention**: Layers 4,8,12 (Flash Attention 2)

## Training Details

### Dataset
- **Source**: FineWeb-Edu (sample-10BT subset)
- **Training Samples**: 4 million examples
- **Validation Split**: 1% (40,000 samples)
- **Text Processing**: Dynamic truncation to 4x block_size during tokenization
- **Tokenizer**: Qwen3 Fast Tokenizer with weight tying enabled

### Training Configuration
- **Hardware**: NVIDIA RTX 5090
- **Training Time**: 3 hours
- **Loss Function**: Cut Your Losses (from "Cut Your Losses in Large-Vocabulary Language Models") - NOT standard Cross-Entropy
- **Optimizer**: AdEMAMix with parameters:
  - Betas: (0.9, 0.999, 0.999)
  - Alpha: 5.0
  - t_alpha: 5000, t_beta3: 5000
  - Weight decay: 0.1
- **Learning Rate Schedule**: Custom cosine with linear warmup
  - Start LR: 3e-4
  - Peak LR: 6e-4 (at 5000 warmup steps)
  - Min LR: 6e-5
- **Batch Size**: 64 per device
- **Precision**: BF16 with torch.compile optimization
- **Hardware Optimizations**: Flash Attention 2
- **Epochs**: 1

### Framework Versions
- **PyTorch**: 2.8.0+cu129
- **Transformers**: 4.57.0.dev0
- **Flash Attention**: 2.x
- **CUDA**: 12.9

## Evaluation Results

### Benchmark Performance (1-shot evaluation)

| Task | Score |
|------|-------|
| ARC-Easy | 39.14% |
| HellaSwag | 26.55% |
| MMLU | 24.25% |
| ARC-Challenge | 17.24% |

*All evaluations performed in few-shot (1-shot) setting*

## Model Architecture Components

### Fourier Analysis Network (FANLayer)
Based on "FANformer: Improving Large Language Models Through Effective Periodicity Modeling":
```
FANLayer'(X) = [cos(WpX)||sin(WpX)||(WpX + Bp)]
```

### LayerNorm Scaling (LNS)
Implements scaling factor 1/√ℓ as described in "The Curse of Depth in Large Language Models":
```
h^(ℓ) = LayerNorm(h^(ℓ)) × (1/√ℓ)
```

### Gradient-Preserving Activation Scaling (GPAS)
Scales activations without penalizing gradients using stop-gradient operations.

### Polynomial Composition Activations (PolyNorm)
Custom activation function based on "Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models".

### Gated Delta Networks
Linear attention mechanism from "Gated Delta Networks: Improving Mamba2 with Delta Rule" for efficient sequence modeling.

## Intended Uses & Limitations

### Intended Uses
- Research into hybrid attention architectures
- Educational purposes for understanding advanced LLM components
- Small-scale language modeling experiments
- Benchmarking novel architectural components

### Limitations
- Relatively small model size (110M parameters) limits capability compared to larger models
- Training limited to 4M samples from single dataset
- Performance below state-of-the-art models on standard benchmarks
- Experimental architecture may have stability considerations in production

### Recommendations
- Best suited for research and educational applications
- Consider fine-tuning for specific downstream tasks
- Monitor performance carefully if adapting for production use

## Training Infrastructure

- **Mixed Precision**: BF16 for numerical stability
- **Compilation**: torch.compile with max-autotune mode