YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

WaveGen: Wave Generative Method

[Paper] | [Project Page] | [Huggingface] | [Document] | [Org Materials] | [WSMs] | [ControlWave]

This generation method is mainly used to create the core space of WSM. The input is text/wsg, output is some mathematical functions (differentiable, movable and deformable waves, rather than probability clouds), which is used to construct the core space for simulating the operation of the world.

First, you should read the project page to understand that WaveGen is currently relatively weak and operates at the level of multiple objects with simple dynamic shapes. If you are pursuing a large number of tasks, please check WSMs (Wave2Pixel decoder), which is the main source of our current work strength. If you want to continue developing WaveGen, please:

Abstract

Wave Generative Method is a conceptual new generation mechanism1. It aims to use mathematical functions to losslessly record dynamic 3D shapes, fundamentally reducing the generated string/values while maintaining the same expression ability. It is dedicated to achieving some new features that was not well accomplished by previous methods2: like consistency, native 3D+t, variable resolution, unlimited duration, predicting the future, variable FPS, physics, causality, predicting the past, pixel-level control of the world distribution, synchronizing the real world with model's core space, and training the world itself, etc.

This architecture is essentially designed for training the core space of the World Snapshot Model (WSM), and it usually requires a WSMs decoder (By default, it is a sub-module of WaveGen). For users, the ControlWave UI is recommended for control.

Motivation: As stated in the Unified Law for Visual Tasks, WSM can almost uniformly and controllably generate any type of visual output. However, we hope that the traditional generation method comes with a core space (functional implicit reconstruction). Therefore, we developed the WaveGen generation method to obtain advanced functions/features and address this shortcoming. It is not yet perfect. But for general cases, the traditional generation methods are sufficient.

1. It is mainly based on RF and parametric expression of mathematical primitives.

2. These features (except for predicting the future, consistency, unlimited duration, native 3D+t) are almost unique to the Wave Model.

Inference

For users:

The ControlWave UI is recommended for control.

For developers:

First, prepare the code.

# You need to install the submodules.
git clone --recursive [email protected]:World-Snapshot/WaveGen.git

# If you have cloned WaveGen but do not have WSMs, or if WSMs needs to be updated.
cd WaveGen
git submodule update --remote --merge

cd WaveGen, you can see that each folder represents a WaveGen model, facilitating development and switching, and making it convenient for the ControlWave UI to use. If you plan to use a particular model of WaveGen, you need to install the corresponding environment.

Then, we can prepare the environment.

conda create -n WaveGen python=3.11 -y
conda activate WaveGen
pip install -r Augustus_v1/env/requirements_WaveGen.txt

Training

2025.9.5 Estimated general WaveGen&WSM training:

Text2Wave:

  1. First, freeze the decoder and train the main model so that it can generate appropriate wave space based on the camera position (requires the common RealCam-Vid dataset and Articulation-XL2.0 dataset). These two might require us to re-extract the relevant camera information, mainly to train a wave-space generation model with 3D knowledge. The former features scenes, extremely long texts and high-quality dynamic perspectives, while the latter utilizes normalized point clouds to learn the shapes of objects.

Note: For the initial version, I used MOVI-a instead of the previous part, merely for a proof-of-concept.

  1. Prepare Data:

Step 1: Download MOVi-A Dataset

python EMS-superquadric_fitting_inference/download_movi_a.py

This will download the MOVi-A dataset from Google Cloud Storage to data/movi_a_128x128/. The download includes:

  • Train split: ~9,700 samples
  • Validation split: ~250 samples
  • Each sample contains 24 frames with RGB, depth, segmentation, and metadata

Note: The download can be interrupted and resumed by running the script again.

Step 2: Preprocess Dataset (Generate Superquadric Caches)

After downloading, preprocess the dataset to fit superquadrics to point clouds and generate cache files:

# Preprocess training set (process first 100 samples for quick testing)
python data/preprocess_dataset.py \
    --data_root data/movi_a_128x128 \
    --split train \
    --max_samples 100 \
    --num_workers 8

# Preprocess validation set
python data/preprocess_dataset.py \
    --data_root data/movi_a_128x128 \
    --split validation \
    --max_samples 10 \
    --num_workers 8

Parameters:

  • --max_samples: Number of samples to process (-1 for all samples)
  • --num_workers: Number of parallel processes (adjust based on your CPU cores)

This step generates Full_Sample_Data_for_Learning_Target.npz files for each sample, containing all training data (superquadric parameters, camera info, physics properties). Processing time: ~5-10 seconds per sample.

Step 2.5 (Optional): Merge .npy Files to Reduce File Count

For better storage efficiency and easier file transfer, you can merge individual .npy files into compressed .npz archives:

# Merge all .npy files in train and validation splits
python data/merge_npy_to_npz.py --data_root data/movi_a_128x128

# Preview what will be merged (dry-run mode)
python data/merge_npy_to_npz.py --data_root data/movi_a_128x128 --dry-run

# Merge only specific split
python data/merge_npy_to_npz.py --data_root data/movi_a_128x128 --split train

# Revert back to .npy files if needed
python data/merge_npy_to_npz.py --data_root data/movi_a_128x128 --revert

Benefits:

  • Reduces file count by ~98% (e.g., 303 files โ†’ 5 .npz files per sample)
  • Saves ~70% storage space with compression
  • Faster file transfers and backups
  • RGB images (.png) are preserved as-is

Note: The training code automatically handles both .npy and .npz formats, so this step is completely optional.

Step 3: Start Training

cd WaveGen_Augustus_v1
bash launch_text2wave_training.sh

The training script will:

  1. Automatically check and preprocess any uncached samples
  2. Train the Text2Wave model using T5 encoder-decoder
  3. Save checkpoints and generation results to core_space/

Training Configuration:

Edit WaveGen_Augustus_v1/configs/default.yaml to adjust:

  • data.max_sequences: Number of training samples (default: 100)
  • training.batch_size: Batch size (default: 24)
  • training.max_steps: Total training steps (default: 50000)
  • Loss weights and other hyperparameters

Resume Training:

bash launch_text2wave_training.sh 1000  # Resume from step 1000

WSMs (Wave2Pixel Decoder):

See the training section of WSMs.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support