--- license: apache-2.0 pipeline_tag: image-to-video --- # HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning
## 🔥 Latest News
* A Best-Practice Guide for HuMo will be released soon. Stay tuned.
* Sep 16, 2025: 🔥🔥 We release the [1.7B weights](https://huggingface.co/bytedance-research/HuMo/tree/main/HuMo-1.7B), which generate a 480P video in 8 minutes on a 32G GPU. The visual quality is lower than that of the 17B model, but the audio-visual sync remains nearly unaffected.
* Sep 13, 2025: 🔥🔥 The 17B model is merged into [ComfyUI-Wan](https://github.com/kijai/ComfyUI-WanVideoWrapper). Thank [kijai](https://github.com/kijai) for the update!
* Sep 10, 2025: 🔥🔥 We release the [17B weights](https://huggingface.co/bytedance-research/HuMo/tree/main/HuMo-17B) and inference codes.
* Sep 9, 2025: We release the [project page](https://phantom-video.github.io/HuMo/) and [Technique-Report](https://arxiv.org/abs/2509.08519/) of **HuMo**
## ✨ Key Features
HuMo is a unified, human-centric video generation framework designed to produce high-quality, fine-grained, and controllable human videos from multimodal inputs—including text, images, and audio. It supports strong text prompt following, consistent subject preservation, synchronized audio-driven motion.
> - **VideoGen from Text-Image** - Customize character appearance, clothing, makeup, props, and scenes using text prompts combined with reference images.
> - **VideoGen from Text-Audio** - Generate audio-synchronized videos solely from text and audio inputs, removing the need for image references and enabling greater creative freedom.
> - **VideoGen from Text-Image-Audio** - Achieve the higher level of customization and control by combining text, image, and audio guidance.
## 📑 Todo List
- [x] Release Paper
- [x] Checkpoint of HuMo-17B
- [x] Checkpoint of HuMo-1.7B
- [x] Inference Codes
- [ ] Text-Image Input
- [x] Text-Audio Input
- [x] Text-Image-Audio Input
- [x] Multi-GPU Inference
- [ ] Best-Practice Guide for HuMo
- [ ] Prompts to Generate Demo of ***Faceless Thrones***
- [ ] Training Data
## ⚡️ Quickstart
### Installation
```
conda create -n humo python=3.11
conda activate humo
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install flash_attn==2.6.3
pip install -r requirements.txt
conda install -c conda-forge ffmpeg
```
### Model Preparation
| Models | Download Link | Notes |
|--------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------|
| HuMo-17B | 🤗 [Huggingface](https://huggingface.co/bytedance-research/HuMo/tree/main/HuMo-17B) | Supports 480P & 720P
| HuMo-1.7B | 🤗 [Huggingface](https://huggingface.co/bytedance-research/HuMo/tree/main/HuMo-1.7B) | Lightweight on 32G GPU
| Wan-2.1 | 🤗 [Huggingface](https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B) | VAE & Text encoder
| Whisper-large-v3 | 🤗 [Huggingface](https://huggingface.co/openai/whisper-large-v3) | Audio encoder
| Audio separator | 🤗 [Huggingface](https://huggingface.co/huangjackson/Kim_Vocal_2) | Remove background noise (optional)
Download models using huggingface-cli:
``` sh
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./weights/Wan2.1-T2V-1.3B
huggingface-cli download bytedance-research/HuMo --local-dir ./weights/HuMo
huggingface-cli download openai/whisper-large-v3 --local-dir ./weights/whisper-large-v3
huggingface-cli download huangjackson/Kim_Vocal_2 --local-dir ./weights/audio_separator
```
### Run Multimodal-Condition-to-Video Generation
Our model is compatible with both 480P and 720P resolutions. 720P inference will achieve much better quality.
> Some tips
> - Please prepare your text, reference images and audio as described in [test_case.json](./examples/test_case.json).
> - We support Multi-GPU inference using FSDP + Sequence Parallel.
> - The model is trained on 97-frame videos at 25 FPS. Generating video longer than 97 frames may degrade the performance. We will provide a new checkpoint for longer generation.
#### Configure HuMo
HuMo’s behavior and output can be customized by modifying [generate.yaml](humo/configs/inference/generate.yaml) configuration file.
The following parameters control generation length, video resolution, and how text, image, and audio inputs are balanced:
```yaml
generation:
frames: