Spaces:
Sleeping
Sleeping
| DETAILS = """ | |
| ### Motivation | |
| Existing tools like the [Hugging Face Model Memory Estimator](https://huggingface.co/spaces/hf-accelerate/model-memory-usage), [DeepSpeed Calculator](https://huggingface.co/spaces/andstor/deepspeed-model-memory-usage), and [DeepSpeed Native Utility](https://deepspeed.readthedocs.io/en/latest/memory.html) are valuable but don't support the full range of modern training configurations. | |
| This tool adds: | |
| - Arbitrary model configurations beyond preset architectures | |
| - FSDP and 5D parallelism support | |
| - Interactive memory breakdowns by category to inform configuration decisions | |
| ### References | |
| Helpful resources used while building this: | |
| - [The Ultra-Scale Playbook](https://huggingface.co/spaces/nanotron/ultrascale-playbook) | |
| - [Reducing Activation Recomputation in Large Transformer Models](https://arxiv.org/abs/2205.05198) | |
| - [Transformer Math - Michael Wornow](https://michaelwornow.net/2024/01/18/counting-params-in-transformer) | |
| - [Transformer Math 101](https://blog.eleuther.ai/transformer-math/) | |
| """ | |
| INSTRUCTIONS = """ | |
| This calculator will estimate a coarse upper bound for memory used per GPU during training (excluding intermediates) | |
| ## How to Use | |
| 1. Use Presets OR Adjust the parallelism, model, and training panels to match your run. | |
| 2. Press **Calculate** to refresh the memory breakdown chart. | |
| 3. Review the details and references below for context on the estimates. | |
| """ | |
| LIMITATIONS = """ | |
| ### Key Assumptions: | |
| - Standard transformer architecture with homogeneous layers | |
| - Adam optimizer | |
| - Mixed precision keeps master weights copy | |
| - Tensor parallelism includes sequence parallelism | |
| - Pipeline parallelism maintains consistent activation memory due to schedule | |
| ### Not Currently Supported: | |
| - Non-standard architectures (alternating dense/sparse layers, custom attention) | |
| - Multi-modal models with vision layers | |
| - Non-homogeneous parameter dtypes (e.g. BF16 & MXFP4 in GPT-OSS). Mixed Precision is supported. | |
| - Kernel/framework overhead and intermediate memory | |
| For advanced configurations, results should be validated against profiling. | |
| """ | |
| ACCURACY = """ | |
| I validated this calculator against the projected memory usage in The Ultra-Scale Playbook w/in 10%. Some overage is expected since the calculator makes pessimistic assumptions and looks for peak memory. Note that you could still OOM from intermediates! | |
| Welcome any detailed memory usage reports along with configurations and framework details to tune this further! | |
| """ |