MiniVLA Angle Selector

A Vision-Language-Action (VLA) model for drone angle prediction. Given a forward-facing drone camera image and a navigation prompt (e.g., "Navigate to the red cube"), predicts a flight direction as one of 36 discrete angles (0-350 degrees, 10-degree increments).

Architecture

  • Vision Backbone: DINOv2 + SigLIP fused @ 224px
  • LLM Backbone: Qwen2.5 0.5B
  • Projector: FusedGeLU MLP (no-align, single-stage)
  • Total Parameters: ~1.26B (vision + projector + LLM)
  • Inference VRAM: ~2.5 GB (bf16)

Training

  1. VLM Pretraining: Single-stage training on LLaVA 665k dataset (projector + LLM jointly, no separate alignment stage)
  2. Angle Fine-tuning: LoRA (r=16, alpha=32) + unfrozen embeddings on 21k drone navigation samples

Performance

Metric Value
Val Accuracy (exact match) 80.0%
Val Angular Error 3.2 degrees
Angle Bins 36 (10-degree steps)

Usage

With Prismatic (openvla-mini)

from prismatic import load

vlm = load("path/to/minivla-angle-selector")

Standalone

from prismatic.models.materialize import (
    get_vision_backbone_and_transform,
    get_llm_backbone_and_tokenizer,
    get_vlm,
)
import torch

# Build model
vision_backbone, _ = get_vision_backbone_and_transform(
    "dinosiglip-vit-so-224px", "resize-naive", image_sequence_len=1
)
llm_backbone, tokenizer = get_llm_backbone_and_tokenizer(
    "qwen25-0_5b-pure", llm_max_length=2048, inference_mode=True,
)
vlm = get_vlm(
    model_id="minivla-angle-selector",
    arch_specifier="no-align+fused-gelu-mlp",
    vision_backbone=vision_backbone,
    llm_backbone=llm_backbone,
)

# Load weights
ckpt = torch.load("checkpoints/latest-checkpoint.pt", map_location="cpu")["model"]
vlm.projector.load_state_dict(ckpt["projector"])
vlm.llm_backbone.load_state_dict(ckpt["llm_backbone"])
vlm.vision_backbone.load_state_dict(ckpt["vision_backbone"])
vlm.to("cuda", dtype=torch.bfloat16)
vlm.eval()

# Predict angle from image
from PIL import Image
image = Image.open("drone_view.png")
prompt_builder = vlm.get_prompt_builder()
prompt_builder.add_turn("human", "Navigate the drone to the red cube")
input_prompt = prompt_builder.get_prompt()

tok = vlm.llm_backbone.tokenizer
input_ids = tok(input_prompt, return_tensors="pt").input_ids.to("cuda")
pixel_values = vlm.vision_backbone.get_image_transform()(image)
pixel_values = pixel_values[None, ...].to("cuda", dtype=torch.bfloat16)

with torch.no_grad():
    output = vlm.forward(input_ids=input_ids, pixel_values=pixel_values, return_dict=True)
    num_patches = vlm.vision_backbone.num_patches
    action_logit = output.logits[0, num_patches:, :][-1, :]
    token_id = action_logit.argmax().item()

# Convert token to angle
vocab_size = len(tok)
angle_code = (vocab_size - 1 - token_id) % 36
angle_degrees = angle_code * 10
print(f"Predicted angle: {angle_degrees} degrees")

Action Space

Code Angle Direction
0 0 deg +X (right)
9 90 deg +Y (forward)
18 180 deg -X (left)
27 270 deg -Y (backward)

Token mapping: token_id = vocab_size - 1 - angle_code

Downloads last month
7
Video Preview
loading