KL3M 500M, 7th Gen Model, Step 0 (4x Stacked Initialization)
A 500M parameter language model created via 4x cyclic layer duplication (G_stack method, NeurIPS 2024) from the KL3M 170M Phase 2+A checkpoint. This is the initialization checkpoint before continued training.
Model Details
- Architecture: Llama-based with Grouped Query Attention (GQA)
- Parameters: 500.3M (487M non-embedding)
- Layers: 120 (4x stacked from 30)
- Source:
alea-institute/kl3m-006-170m-checkpoint-63000 - Stacking Method: G_stack cyclic duplication
- Training Status: Initialization only (0 steps on 120-layer architecture)
- Precision: BF16
Model Architecture
- Hidden Size: 576 (unchanged from source)
- Layers: 120 (4ร from 30)
- Attention Heads: 9 (3 KV heads with GQA)
- Intermediate Size: 1536 (unchanged from source)
- Vocabulary: 131,072 tokens
- RoPE Theta: 100,000
- Parameter Growth: 181.7M โ 500.3M (2.75ร increase)
G_stack Methodology
Cyclic Layer Duplication
Based on "Stacking Your Transformers" (NeurIPS 2024, arXiv:2405.15319):
Stacking Pattern:
Source (30 layers): [0, 1, 2, ..., 28, 29]
Target (120 layers): [0-29, 0-29, 0-29, 0-29]
โโ Cyclic repetition 4 times
Preserved Components:
- Embedding layer (wte): Copied once
- Final layer norm (ln_f): Copied once
- LM head: Copied once
Duplicated Components:
- All 30 transformer blocks repeated 4 times
- Each block contains: self-attention (q/k/v/o_proj), MLP (gate/up/down_proj), layer norms
Expected Training Efficiency
Per G_stack paper findings:
- Token efficiency: ~54% fewer tokens to reach target loss vs training 120L from scratch
- Computational savings: ~46% reduction in FLOPs
- Convergence: Duplicated layers naturally diverge during training
Initialization Properties
Spectral Health (Inherited from Source)
All 120 layers start with identical conditioning (4 copies of source checkpoint):
Attention layers (480 total = 4 ร 120):
- Max condition: 2501 (inherited from Phase A optimization)
- Median condition: 2168
- All duplicates start with proven stable values
MLP layers (360 total):
- Excellent conditioning (median ~5-8)
- Inherited from well-conditioned source
Expected behavior:
- Layers 0-29: First copy
- Layers 30-59: Second copy (identical to 0-29 initially)
- Layers 60-89: Third copy (identical to 0-29 initially)
- Layers 90-119: Fourth copy (identical to 0-29 initially)
During training, these 4 copies will diverge and specialize, creating a true 120-layer deep network.
Training Plan
Phase 2 Continued (500M Architecture)
Depth-scaled learning rates:
- Muon LR: 3.29e-5 (depth-scaled: base ร โ(16/120))
- Aux LR: 9e-5
- Per-layer LR multipliers: Same as 170M (0.7 for q/o_proj, 0.9 for k/v_proj)
Layer-selective spectral clamping (adjusted for depth):
- Attention: Every 30 steps (more frequent due to 4ร more layers)
- MLP: Every 160 steps (unchanged)
- LM head: Every 80 steps (unchanged)
- Max conditions: Same targets (2500/3000/2000)
Batch configuration:
- Micro batch: 4 (reduced from 6 for memory)
- Grad accum: 3 (increased from 2)
- Effective batch: 12 (maintained)
Expected trajectory:
- Steps 0-5K: Warmup, layers begin to diverge
- Steps 5K-50K: Rapid improvement as copies specialize
- Steps 50K-150K: Match/exceed 170M@63K quality
- Steps 150K+: Push beyond 170M capabilities
Stacking Metadata
{
"stacking_method": "cyclic_duplication",
"paper": "Stacking Your Transformers (NeurIPS 2024, arXiv:2405.15319)",
"source_checkpoint": "checkpoints/muon_170m_phase2/step-00063000",
"growth_factor": 4,
"source_layers": 30,
"target_layers": 120,
"stacking_pattern": "cyclic"
}
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load the 500M stacked initialization
model = AutoModelForCausalLM.from_pretrained(
"alea-institute/kl3m-007-500m-step0",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("alea-institute/kl3m-007-500m-step0")
# Generate (will produce low-quality output until trained)
inputs = tokenizer("This Agreement is entered into", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))
Note: This is an untrained initialization. For production use, see the trained checkpoints at alea-institute/kl3m-007-500m-checkpoint-*.
Why G_stack?
Advantages over Training from Scratch
- Proven warm start: All 120 layers begin with learned representations
- Faster convergence: 54% fewer tokens expected
- Stable training: Avoids cold-start instabilities
- Spectral inheritance: Good conditioning from Phase A source
- Simple implementation: Just cyclic duplication, no complex initialization
Alternative Approaches Not Used
- Function-preserving init: More complex, similar results per paper
- Random initialization: Much slower convergence
- Progressive stacking: More complex training schedule
- Width expansion first: Different scaling dimension
Model Comparison
| Model | Layers | Params | Training Status | Use Case |
|---|---|---|---|---|
| kl3m-006-170m-checkpoint-63000 | 30 | 181.7M | Trained (63K steps) | Production 170M |
| kl3m-007-500m-step0 | 120 | 500.3M | Step 0 (init only) | Research/base |
| kl3m-007-500m-checkpoint-* | 120 | 500.3M | In training | Production 500M (future) |
Training Philosophy
G_stack enables efficient depth scaling:
- Start with proven 170M model (63K steps, 15.83B tokens)
- Stack to 4ร depth (120 layers)
- Continue training with 54% efficiency gain
- Achieve 500M quality in ~100-150K steps (vs 250K from scratch)
This approach leverages transfer learning in the depth dimension rather than traditional width scaling or fine-tuning.
Next Steps
This initialization checkpoint will be trained with:
- Target steps: 250,000 (conservative estimate)
- Expected quality match: 170M@300K by step ~150K
- Dataset: Same multi-domain legal corpus
- Optimizer: Muon with Phase A improvements
Follow training progress in the kl3m-007-500m-checkpoint-* series.
Model Card Authors
Alea Institute
Citation
For G_stack technical details:
@inproceedings{gstack2024,
title={Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training},
author={Du, Wenyu and Luo, Tongxu and Qiu, Zihan and Huang, Zeyu and Shen, Yikang and Cheng, Reynold and Guo, Yike and Fu, Jie},
booktitle={NeurIPS},
year={2024},
note={arXiv:2405.15319}
}
@misc{kl3m2025,
title={KL3M: Knowledge-Guided Language Model Training with G_stack Depth Expansion},
author={Alea Institute},
year={2025},
url={https://arxiv.org/abs/2504.07854},
note={500M model via 4x cyclic duplication from 170M Phase 2+A}
}
License
Apache 2.0
- Downloads last month
- 13