Spaces:

rubenaghayan
/

llm_memory_visualizer

Sleeping

App Files Files Community

llm_memory_visualizer / details.py

rubenaghayan

better defaults and validation section

64abcca 2 months ago

raw

history blame contribute delete

2.54 kB

	DETAILS = """
	### Motivation
	Existing tools like the [Hugging Face Model Memory Estimator](https://huggingface.co/spaces/hf-accelerate/model-memory-usage), [DeepSpeed Calculator](https://huggingface.co/spaces/andstor/deepspeed-model-memory-usage), and [DeepSpeed Native Utility](https://deepspeed.readthedocs.io/en/latest/memory.html) are valuable but don't support the full range of modern training configurations.

	This tool adds:
	- Arbitrary model configurations beyond preset architectures
	- FSDP and 5D parallelism support
	- Interactive memory breakdowns by category to inform configuration decisions

	### References
	Helpful resources used while building this:
	- [The Ultra-Scale Playbook](https://huggingface.co/spaces/nanotron/ultrascale-playbook)
	- [Reducing Activation Recomputation in Large Transformer Models](https://arxiv.org/abs/2205.05198)
	- [Transformer Math - Michael Wornow](https://michaelwornow.net/2024/01/18/counting-params-in-transformer)
	- [Transformer Math 101](https://blog.eleuther.ai/transformer-math/)
	"""

	INSTRUCTIONS = """
	This calculator will estimate a coarse upper bound for memory used per GPU during training (excluding intermediates)
	## How to Use
	1. Use Presets OR Adjust the parallelism, model, and training panels to match your run.
	2. Press Calculate to refresh the memory breakdown chart.
	3. Review the details and references below for context on the estimates.
	"""

	LIMITATIONS = """
	### Key Assumptions:
	- Standard transformer architecture with homogeneous layers
	- Adam optimizer
	- Mixed precision keeps master weights copy
	- Tensor parallelism includes sequence parallelism
	- Pipeline parallelism maintains consistent activation memory due to schedule

	### Not Currently Supported:
	- Non-standard architectures (alternating dense/sparse layers, custom attention)
	- Multi-modal models with vision layers
	- Non-homogeneous parameter dtypes (e.g. BF16 & MXFP4 in GPT-OSS). Mixed Precision is supported.
	- Kernel/framework overhead and intermediate memory

	For advanced configurations, results should be validated against profiling.
	"""


	ACCURACY = """
	I validated this calculator against the projected memory usage in The Ultra-Scale Playbook w/in 10%. Some overage is expected since the calculator makes pessimistic assumptions and looks for peak memory. Note that you could still OOM from intermediates!
	Welcome any detailed memory usage reports along with configurations and framework details to tune this further!
	"""