NeoLLM / README.md

Update README.md

c8bde25 verified 6 months ago

5.72 kB

library_name: transformers
tags:
  - pytorch
  - neollm
  - hybrid-attention
  - fanformer
  - gated-delta-networks
  - polynomial-activations
  - fineweb-edu
  - ademamix
  - custom-scheduler
  - flash-attention
  - torch-compile
pipeline_tag: text-generation
model-index:
  - name: NeoLLM
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: multiple-choice
          name: ARC-Easy
        metrics:
          - type: accuracy
            value: 39.14
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: multiple-choice
          name: HellaSwag
        metrics:
          - type: accuracy
            value: 26.55
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: multiple-choice
          name: MMLU
        metrics:
          - type: accuracy
            value: 24.25
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: multiple-choice
          name: ARC-Challenge
        metrics:
          - type: accuracy
            value: 17.24
license: apache-2.0
datasets:
  - HuggingFaceFW/fineweb-edu
language:
  - en

NeoLLM

NeoLLM is a hybrid architecture language model that combines multiple state-of-the-art techniques for efficient and effective language modeling. This 110M parameter model demonstrates novel architectural innovations including Fourier Analysis Networks, hybrid attention mechanisms, and advanced normalization techniques.

Model Description

NeoLLM incorporates several cutting-edge components:

FANformer Integration: Fourier Analysis Network (FAN) layers for effective periodicity modeling with fan_ratio of 0.125
Hybrid Attention Architecture: Follows Qwen3-Next's approach with 1 full attention layer per 3 linear attention layers
Polynomial Composition Activations: PolyNorm activation functions in MLP layers for enhanced dynamics
Advanced Normalization: LayerNorm Scaling (LNS) and Gradient-Preserving Activation Scaling (GPAS)
Efficient Linear Attention: Gated Delta Networks for improved computational efficiency

Architecture Details

Model Size: 110M parameters (77M embeddings + 33M non-embeddings)
Hidden Size: 512
Layers: 12 layers with hybrid attention pattern
Attention Heads: 8 (with 2 KV heads using Grouped Query Attention)
Intermediate Size: 1024
Sequence Length: 512 tokens
Vocabulary: 151,665 tokens (Qwen3 tokenizer)

Layer Pattern

The model uses a hybrid attention pattern where layers alternate between:

Linear Attention: Layers 1,2,3,5,6,7,9,10,11 (Gated Delta Networks)
Full Attention: Layers 4,8,12 (Flash Attention 2)

Training Details

Dataset

Source: FineWeb-Edu (sample-10BT subset)
Training Samples: 4 million examples
Validation Split: 1% (40,000 samples)
Text Processing: Dynamic truncation to 4x block_size during tokenization
Tokenizer: Qwen3 Fast Tokenizer with weight tying enabled

Training Configuration

Hardware: NVIDIA RTX 5090
Training Time: 3 hours
Loss Function: Cut Your Losses (from "Cut Your Losses in Large-Vocabulary Language Models") - NOT standard Cross-Entropy
Optimizer: AdEMAMix with parameters:
- Betas: (0.9, 0.999, 0.999)
- Alpha: 5.0
- t_alpha: 5000, t_beta3: 5000
- Weight decay: 0.1
Learning Rate Schedule: Custom cosine with linear warmup
- Start LR: 3e-4
- Peak LR: 6e-4 (at 5000 warmup steps)
- Min LR: 6e-5
Batch Size: 64 per device
Precision: BF16 with torch.compile optimization
Hardware Optimizations: Flash Attention 2
Epochs: 1

Framework Versions

PyTorch: 2.8.0+cu129
Transformers: 4.57.0.dev0
Flash Attention: 2.x
CUDA: 12.9

Evaluation Results

Benchmark Performance (1-shot evaluation)

Task	Score
ARC-Easy	39.14%
HellaSwag	26.55%
MMLU	24.25%
ARC-Challenge	17.24%

All evaluations performed in few-shot (1-shot) setting

Model Architecture Components

Fourier Analysis Network (FANLayer)

Based on "FANformer: Improving Large Language Models Through Effective Periodicity Modeling":

FANLayer'(X) = [cos(WpX)||sin(WpX)||(WpX + Bp)]

LayerNorm Scaling (LNS)

Implements scaling factor 1/√ℓ as described in "The Curse of Depth in Large Language Models":

h^(ℓ) = LayerNorm(h^(ℓ)) × (1/√ℓ)

Gradient-Preserving Activation Scaling (GPAS)

Scales activations without penalizing gradients using stop-gradient operations.

Polynomial Composition Activations (PolyNorm)

Custom activation function based on "Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models".

Gated Delta Networks

Linear attention mechanism from "Gated Delta Networks: Improving Mamba2 with Delta Rule" for efficient sequence modeling.

Intended Uses & Limitations

Intended Uses

Research into hybrid attention architectures
Educational purposes for understanding advanced LLM components
Small-scale language modeling experiments
Benchmarking novel architectural components

Limitations

Relatively small model size (110M parameters) limits capability compared to larger models
Training limited to 4M samples from single dataset
Performance below state-of-the-art models on standard benchmarks
Experimental architecture may have stability considerations in production

Recommendations

Best suited for research and educational applications
Consider fine-tuning for specific downstream tasks
Monitor performance carefully if adapting for production use

Training Infrastructure

Mixed Precision: BF16 for numerical stability
Compilation: torch.compile with max-autotune mode