KL3M 500M, 7th Gen Model, Step 0 (4x Stacked Initialization)

A 500M parameter language model created via 4x cyclic layer duplication (G_stack method, NeurIPS 2024) from the KL3M 170M Phase 2+A checkpoint. This is the initialization checkpoint before continued training.

Model Details

  • Architecture: Llama-based with Grouped Query Attention (GQA)
  • Parameters: 500.3M (487M non-embedding)
  • Layers: 120 (4x stacked from 30)
  • Source: alea-institute/kl3m-006-170m-checkpoint-63000
  • Stacking Method: G_stack cyclic duplication
  • Training Status: Initialization only (0 steps on 120-layer architecture)
  • Precision: BF16

Model Architecture

  • Hidden Size: 576 (unchanged from source)
  • Layers: 120 (4ร— from 30)
  • Attention Heads: 9 (3 KV heads with GQA)
  • Intermediate Size: 1536 (unchanged from source)
  • Vocabulary: 131,072 tokens
  • RoPE Theta: 100,000
  • Parameter Growth: 181.7M โ†’ 500.3M (2.75ร— increase)

G_stack Methodology

Cyclic Layer Duplication

Based on "Stacking Your Transformers" (NeurIPS 2024, arXiv:2405.15319):

Stacking Pattern:

Source (30 layers):  [0, 1, 2, ..., 28, 29]
Target (120 layers): [0-29, 0-29, 0-29, 0-29]
                     โ””โ”€ Cyclic repetition 4 times

Preserved Components:

  • Embedding layer (wte): Copied once
  • Final layer norm (ln_f): Copied once
  • LM head: Copied once

Duplicated Components:

  • All 30 transformer blocks repeated 4 times
  • Each block contains: self-attention (q/k/v/o_proj), MLP (gate/up/down_proj), layer norms

Expected Training Efficiency

Per G_stack paper findings:

  • Token efficiency: ~54% fewer tokens to reach target loss vs training 120L from scratch
  • Computational savings: ~46% reduction in FLOPs
  • Convergence: Duplicated layers naturally diverge during training

Initialization Properties

Spectral Health (Inherited from Source)

All 120 layers start with identical conditioning (4 copies of source checkpoint):

Attention layers (480 total = 4 ร— 120):

  • Max condition: 2501 (inherited from Phase A optimization)
  • Median condition: 2168
  • All duplicates start with proven stable values

MLP layers (360 total):

  • Excellent conditioning (median ~5-8)
  • Inherited from well-conditioned source

Expected behavior:

  • Layers 0-29: First copy
  • Layers 30-59: Second copy (identical to 0-29 initially)
  • Layers 60-89: Third copy (identical to 0-29 initially)
  • Layers 90-119: Fourth copy (identical to 0-29 initially)

During training, these 4 copies will diverge and specialize, creating a true 120-layer deep network.

Training Plan

Phase 2 Continued (500M Architecture)

Depth-scaled learning rates:

  • Muon LR: 3.29e-5 (depth-scaled: base ร— โˆš(16/120))
  • Aux LR: 9e-5
  • Per-layer LR multipliers: Same as 170M (0.7 for q/o_proj, 0.9 for k/v_proj)

Layer-selective spectral clamping (adjusted for depth):

  • Attention: Every 30 steps (more frequent due to 4ร— more layers)
  • MLP: Every 160 steps (unchanged)
  • LM head: Every 80 steps (unchanged)
  • Max conditions: Same targets (2500/3000/2000)

Batch configuration:

  • Micro batch: 4 (reduced from 6 for memory)
  • Grad accum: 3 (increased from 2)
  • Effective batch: 12 (maintained)

Expected trajectory:

  • Steps 0-5K: Warmup, layers begin to diverge
  • Steps 5K-50K: Rapid improvement as copies specialize
  • Steps 50K-150K: Match/exceed 170M@63K quality
  • Steps 150K+: Push beyond 170M capabilities

Stacking Metadata

{
  "stacking_method": "cyclic_duplication",
  "paper": "Stacking Your Transformers (NeurIPS 2024, arXiv:2405.15319)",
  "source_checkpoint": "checkpoints/muon_170m_phase2/step-00063000",
  "growth_factor": 4,
  "source_layers": 30,
  "target_layers": 120,
  "stacking_pattern": "cyclic"
}

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the 500M stacked initialization
model = AutoModelForCausalLM.from_pretrained(
    "alea-institute/kl3m-007-500m-step0",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("alea-institute/kl3m-007-500m-step0")

# Generate (will produce low-quality output until trained)
inputs = tokenizer("This Agreement is entered into", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))

Note: This is an untrained initialization. For production use, see the trained checkpoints at alea-institute/kl3m-007-500m-checkpoint-*.

Why G_stack?

Advantages over Training from Scratch

  1. Proven warm start: All 120 layers begin with learned representations
  2. Faster convergence: 54% fewer tokens expected
  3. Stable training: Avoids cold-start instabilities
  4. Spectral inheritance: Good conditioning from Phase A source
  5. Simple implementation: Just cyclic duplication, no complex initialization

Alternative Approaches Not Used

  • Function-preserving init: More complex, similar results per paper
  • Random initialization: Much slower convergence
  • Progressive stacking: More complex training schedule
  • Width expansion first: Different scaling dimension

Model Comparison

Model Layers Params Training Status Use Case
kl3m-006-170m-checkpoint-63000 30 181.7M Trained (63K steps) Production 170M
kl3m-007-500m-step0 120 500.3M Step 0 (init only) Research/base
kl3m-007-500m-checkpoint-* 120 500.3M In training Production 500M (future)

Training Philosophy

G_stack enables efficient depth scaling:

  • Start with proven 170M model (63K steps, 15.83B tokens)
  • Stack to 4ร— depth (120 layers)
  • Continue training with 54% efficiency gain
  • Achieve 500M quality in ~100-150K steps (vs 250K from scratch)

This approach leverages transfer learning in the depth dimension rather than traditional width scaling or fine-tuning.

Next Steps

This initialization checkpoint will be trained with:

  • Target steps: 250,000 (conservative estimate)
  • Expected quality match: 170M@300K by step ~150K
  • Dataset: Same multi-domain legal corpus
  • Optimizer: Muon with Phase A improvements

Follow training progress in the kl3m-007-500m-checkpoint-* series.

Model Card Authors

Alea Institute

Citation

For G_stack technical details:

@inproceedings{gstack2024,
  title={Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training},
  author={Du, Wenyu and Luo, Tongxu and Qiu, Zihan and Huang, Zeyu and Shen, Yikang and Cheng, Reynold and Guo, Yike and Fu, Jie},
  booktitle={NeurIPS},
  year={2024},
  note={arXiv:2405.15319}
}

@misc{kl3m2025,
  title={KL3M: Knowledge-Guided Language Model Training with G_stack Depth Expansion},
  author={Alea Institute},
  year={2025},
  url={https://arxiv.org/abs/2504.07854},
  note={500M model via 4x cyclic duplication from 170M Phase 2+A}
}

License

Apache 2.0

Downloads last month
13
Safetensors
Model size
0.5B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support