MIMIC: Motion Imitation from Massive Internet Clips
A 4.0B-parameter vision-language-action model for full-body humanoid control, trained entirely from internet-scale human video.
Model Details
- Architecture: Qwen3-VL-4B (early exit at layer 18) + 24L/1536D DiT action head
- Parameters: ~4.0B total (2.2B truncated LLM + 415M vision encoder + 1.28B DiT + 132M LoRA)
- Action space: 22-DoF joint angles at 10Hz
- Action horizon: 16 steps (1.6s)
- Training data: movement-strict-164 (164,390 clips, ~1.9M samples), produced by re-tracking and VLM-judging the raw output of our two-stage Kinetics-700 processing pipeline.
- Best validation loss: 0.1097
- Training compute: 4 x RTX Pro Blackwell,
5.9 days (566 GPU-hours) - Checkpoint step: 29,060
Held-out evaluation
500 clips sampled from the held-out validation split. Joint-angle RMSE in degrees, future-only ($t{=}1..15$, 1.5s of prediction), with rolling re-init every $S$ timesteps using ground-truth state.
| Step $S$ | MIMIC (this model) | Static baseline | Linear baseline |
|---|---|---|---|
| 3 | 23.8 | 22.1 | 39.4 |
| 8 | 32.7 | 32.3 | 95.0 |
| 16 | 39.8 | 41.0 | 188.4 |
On the high-motion top-quartile subset at $S{=}16$: 57.5° (model) vs 60.7° (static), a 5.3% reduction in error. Compared to a 325K-clip ablation model trained on the unfiltered intermediate corpus, MIMIC lowers all-clip RMSE from 43.7° to 39.8° (-9%) and high-motion RMSE from 63.9° to 57.5° (-10%), and widens the model-versus-static gap from 0.1° to 1.2° on average (12x larger) and 1.7° to 3.2° on the high-motion subset.
Performance is bimodal across activity type. Cyclical or repeated motions (pull-ups, squats, jumping rope) predict much more accurately than long compositional sequences (cooking, multi-phase sports actions, dance routines). We read this as a data-coverage gap rather than a model-capacity ceiling: ingesting datasets with denser coverage of multi-step activities would likely close it.
A second model trained on movement-287 (286,890 clips, includes lower-motion classes) is also available and reaches the same long-horizon RMSE with sharper short-horizon predictions (step-1 median 2.5° vs 2.9°).
Usage
import torch, yaml
from training.vla_model import VLAModel, VLAConfig
config = yaml.safe_load(open("config.yaml"))
model = VLAModel(VLAConfig(**config["model_config"]))
ckpt = torch.load("checkpoint.pth", map_location="cpu", weights_only=False)
state_dict = ckpt.get("model_state_dict", ckpt)
state_dict = {k.removeprefix("module."): v for k, v in state_dict.items()}
model.load_state_dict(state_dict, strict=False)
model.eval().cuda()
See the GitHub repo for full inference and training code.
Training
Flow matching loss on movement-strict-164. The vision encoder is frozen throughout; the Qwen3-VL-4B backbone uses LoRA (rank 128). The DiT action head is trained from scratch. The training set is re-tracked through a multi-frame YOLO plus Qwen oracle plus sticky IoU tracker pipeline before judgment by a 235B VLM, then filtered to clips passing both deterministic continuity checks and the VLM verdict on tracking consistency and motion-label match.
Citation
Paper forthcoming.
License
Apache 2.0
- Downloads last month
- 3