NeoLLM / README.md
KitsuVp's picture
Update README.md
c8bde25 verified
|
raw
history blame
5.72 kB
metadata
library_name: transformers
tags:
  - pytorch
  - neollm
  - hybrid-attention
  - fanformer
  - gated-delta-networks
  - polynomial-activations
  - fineweb-edu
  - ademamix
  - custom-scheduler
  - flash-attention
  - torch-compile
pipeline_tag: text-generation
model-index:
  - name: NeoLLM
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: multiple-choice
          name: ARC-Easy
        metrics:
          - type: accuracy
            value: 39.14
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: multiple-choice
          name: HellaSwag
        metrics:
          - type: accuracy
            value: 26.55
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: multiple-choice
          name: MMLU
        metrics:
          - type: accuracy
            value: 24.25
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: multiple-choice
          name: ARC-Challenge
        metrics:
          - type: accuracy
            value: 17.24
license: apache-2.0
datasets:
  - HuggingFaceFW/fineweb-edu
language:
  - en

NeoLLM

NeoLLM is a hybrid architecture language model that combines multiple state-of-the-art techniques for efficient and effective language modeling. This 110M parameter model demonstrates novel architectural innovations including Fourier Analysis Networks, hybrid attention mechanisms, and advanced normalization techniques.

Model Description

NeoLLM incorporates several cutting-edge components:

  • FANformer Integration: Fourier Analysis Network (FAN) layers for effective periodicity modeling with fan_ratio of 0.125
  • Hybrid Attention Architecture: Follows Qwen3-Next's approach with 1 full attention layer per 3 linear attention layers
  • Polynomial Composition Activations: PolyNorm activation functions in MLP layers for enhanced dynamics
  • Advanced Normalization: LayerNorm Scaling (LNS) and Gradient-Preserving Activation Scaling (GPAS)
  • Efficient Linear Attention: Gated Delta Networks for improved computational efficiency

Architecture Details

  • Model Size: 110M parameters (77M embeddings + 33M non-embeddings)
  • Hidden Size: 512
  • Layers: 12 layers with hybrid attention pattern
  • Attention Heads: 8 (with 2 KV heads using Grouped Query Attention)
  • Intermediate Size: 1024
  • Sequence Length: 512 tokens
  • Vocabulary: 151,665 tokens (Qwen3 tokenizer)

Layer Pattern

The model uses a hybrid attention pattern where layers alternate between:

  • Linear Attention: Layers 1,2,3,5,6,7,9,10,11 (Gated Delta Networks)
  • Full Attention: Layers 4,8,12 (Flash Attention 2)

Training Details

Dataset

  • Source: FineWeb-Edu (sample-10BT subset)
  • Training Samples: 4 million examples
  • Validation Split: 1% (40,000 samples)
  • Text Processing: Dynamic truncation to 4x block_size during tokenization
  • Tokenizer: Qwen3 Fast Tokenizer with weight tying enabled

Training Configuration

  • Hardware: NVIDIA RTX 5090
  • Training Time: 3 hours
  • Loss Function: Cut Your Losses (from "Cut Your Losses in Large-Vocabulary Language Models") - NOT standard Cross-Entropy
  • Optimizer: AdEMAMix with parameters:
    • Betas: (0.9, 0.999, 0.999)
    • Alpha: 5.0
    • t_alpha: 5000, t_beta3: 5000
    • Weight decay: 0.1
  • Learning Rate Schedule: Custom cosine with linear warmup
    • Start LR: 3e-4
    • Peak LR: 6e-4 (at 5000 warmup steps)
    • Min LR: 6e-5
  • Batch Size: 64 per device
  • Precision: BF16 with torch.compile optimization
  • Hardware Optimizations: Flash Attention 2
  • Epochs: 1

Framework Versions

  • PyTorch: 2.8.0+cu129
  • Transformers: 4.57.0.dev0
  • Flash Attention: 2.x
  • CUDA: 12.9

Evaluation Results

Benchmark Performance (1-shot evaluation)

Task Score
ARC-Easy 39.14%
HellaSwag 26.55%
MMLU 24.25%
ARC-Challenge 17.24%

All evaluations performed in few-shot (1-shot) setting

Model Architecture Components

Fourier Analysis Network (FANLayer)

Based on "FANformer: Improving Large Language Models Through Effective Periodicity Modeling":

FANLayer'(X) = [cos(WpX)||sin(WpX)||(WpX + Bp)]

LayerNorm Scaling (LNS)

Implements scaling factor 1/√ℓ as described in "The Curse of Depth in Large Language Models":

h^(ℓ) = LayerNorm(h^(ℓ)) × (1/√ℓ)

Gradient-Preserving Activation Scaling (GPAS)

Scales activations without penalizing gradients using stop-gradient operations.

Polynomial Composition Activations (PolyNorm)

Custom activation function based on "Polynomial Composition Activations: Unleashing the Dynamics of Large Language Models".

Gated Delta Networks

Linear attention mechanism from "Gated Delta Networks: Improving Mamba2 with Delta Rule" for efficient sequence modeling.

Intended Uses & Limitations

Intended Uses

  • Research into hybrid attention architectures
  • Educational purposes for understanding advanced LLM components
  • Small-scale language modeling experiments
  • Benchmarking novel architectural components

Limitations

  • Relatively small model size (110M parameters) limits capability compared to larger models
  • Training limited to 4M samples from single dataset
  • Performance below state-of-the-art models on standard benchmarks
  • Experimental architecture may have stability considerations in production

Recommendations

  • Best suited for research and educational applications
  • Consider fine-tuning for specific downstream tasks
  • Monitor performance carefully if adapting for production use

Training Infrastructure

  • Mixed Precision: BF16 for numerical stability
  • Compilation: torch.compile with max-autotune mode