MIMIC: Motion Imitation from Massive Internet Clips

A 4.0B-parameter vision-language-action model for full-body humanoid control, trained entirely from internet-scale human video.

Model Details

Architecture: Qwen3-VL-4B (early exit at layer 18) + 24L/1536D DiT action head
Parameters: ~4.0B total (2.2B truncated LLM + 415M vision encoder + 1.28B DiT + 132M LoRA)
Action space: 22-DoF joint angles at 10Hz
Action horizon: 16 steps (1.6s)
Training data: movement-strict-164 (164,390 clips, ~1.9M samples), produced by re-tracking and VLM-judging the raw output of our two-stage Kinetics-700 processing pipeline.
Best validation loss: 0.1097
Training compute: 4 x RTX Pro Blackwell, ~~5.9 days (~~566 GPU-hours)
Checkpoint step: 29,060

Held-out evaluation

500 clips sampled from the held-out validation split. Joint-angle RMSE in degrees, future-only ($t{=}1..15$, 1.5s of prediction), with rolling re-init every $S$ timesteps using ground-truth state.

Step $S$	MIMIC (this model)	Static baseline	Linear baseline
3	23.8	22.1	39.4
8	32.7	32.3	95.0
16	39.8	41.0	188.4

On the high-motion top-quartile subset at $S{=}16$: 57.5° (model) vs 60.7° (static), a 5.3% reduction in error. Compared to a 325K-clip ablation model trained on the unfiltered intermediate corpus, MIMIC lowers all-clip RMSE from 43.7° to 39.8° (-9%) and high-motion RMSE from 63.9° to 57.5° (-10%), and widens the model-versus-static gap from 0.1° to 1.2° on average (12x larger) and 1.7° to 3.2° on the high-motion subset.

Performance is bimodal across activity type. Cyclical or repeated motions (pull-ups, squats, jumping rope) predict much more accurately than long compositional sequences (cooking, multi-phase sports actions, dance routines). We read this as a data-coverage gap rather than a model-capacity ceiling: ingesting datasets with denser coverage of multi-step activities would likely close it.

A second model trained on movement-287 (286,890 clips, includes lower-motion classes) is also available and reaches the same long-horizon RMSE with sharper short-horizon predictions (step-1 median 2.5° vs 2.9°).

Usage

import torch, yaml
from training.vla_model import VLAModel, VLAConfig

config = yaml.safe_load(open("config.yaml"))
model = VLAModel(VLAConfig(**config["model_config"]))

ckpt = torch.load("checkpoint.pth", map_location="cpu", weights_only=False)
state_dict = ckpt.get("model_state_dict", ckpt)
state_dict = {k.removeprefix("module."): v for k, v in state_dict.items()}
model.load_state_dict(state_dict, strict=False)
model.eval().cuda()

See the GitHub repo for full inference and training code.

Training

Flow matching loss on movement-strict-164. The vision encoder is frozen throughout; the Qwen3-VL-4B backbone uses LoRA (rank 128). The DiT action head is trained from scratch. The training set is re-tracked through a multi-frame YOLO plus Qwen oracle plus sticky IoU tracker pipeline before judgment by a 235B VLM, then filtered to clips passing both deterministic continuity checks and the VLM verdict on tracking consistency and motion-label match.

Citation

Paper forthcoming.

License

Apache 2.0

Downloads last month: 3

Video Preview

Robotics

maxsegan
/

mimic-vlam