MiniVLA Angle Selector
A Vision-Language-Action (VLA) model for drone angle prediction. Given a forward-facing drone camera image and a navigation prompt (e.g., "Navigate to the red cube"), predicts a flight direction as one of 36 discrete angles (0-350 degrees, 10-degree increments).
Architecture
- Vision Backbone: DINOv2 + SigLIP fused @ 224px
- LLM Backbone: Qwen2.5 0.5B
- Projector: FusedGeLU MLP (no-align, single-stage)
- Total Parameters: ~1.26B (vision + projector + LLM)
- Inference VRAM: ~2.5 GB (bf16)
Training
- VLM Pretraining: Single-stage training on LLaVA 665k dataset (projector + LLM jointly, no separate alignment stage)
- Angle Fine-tuning: LoRA (r=16, alpha=32) + unfrozen embeddings on 21k drone navigation samples
Performance
| Metric | Value |
|---|---|
| Val Accuracy (exact match) | 80.0% |
| Val Angular Error | 3.2 degrees |
| Angle Bins | 36 (10-degree steps) |
Usage
With Prismatic (openvla-mini)
from prismatic import load
vlm = load("path/to/minivla-angle-selector")
Standalone
from prismatic.models.materialize import (
get_vision_backbone_and_transform,
get_llm_backbone_and_tokenizer,
get_vlm,
)
import torch
# Build model
vision_backbone, _ = get_vision_backbone_and_transform(
"dinosiglip-vit-so-224px", "resize-naive", image_sequence_len=1
)
llm_backbone, tokenizer = get_llm_backbone_and_tokenizer(
"qwen25-0_5b-pure", llm_max_length=2048, inference_mode=True,
)
vlm = get_vlm(
model_id="minivla-angle-selector",
arch_specifier="no-align+fused-gelu-mlp",
vision_backbone=vision_backbone,
llm_backbone=llm_backbone,
)
# Load weights
ckpt = torch.load("checkpoints/latest-checkpoint.pt", map_location="cpu")["model"]
vlm.projector.load_state_dict(ckpt["projector"])
vlm.llm_backbone.load_state_dict(ckpt["llm_backbone"])
vlm.vision_backbone.load_state_dict(ckpt["vision_backbone"])
vlm.to("cuda", dtype=torch.bfloat16)
vlm.eval()
# Predict angle from image
from PIL import Image
image = Image.open("drone_view.png")
prompt_builder = vlm.get_prompt_builder()
prompt_builder.add_turn("human", "Navigate the drone to the red cube")
input_prompt = prompt_builder.get_prompt()
tok = vlm.llm_backbone.tokenizer
input_ids = tok(input_prompt, return_tensors="pt").input_ids.to("cuda")
pixel_values = vlm.vision_backbone.get_image_transform()(image)
pixel_values = pixel_values[None, ...].to("cuda", dtype=torch.bfloat16)
with torch.no_grad():
output = vlm.forward(input_ids=input_ids, pixel_values=pixel_values, return_dict=True)
num_patches = vlm.vision_backbone.num_patches
action_logit = output.logits[0, num_patches:, :][-1, :]
token_id = action_logit.argmax().item()
# Convert token to angle
vocab_size = len(tok)
angle_code = (vocab_size - 1 - token_id) % 36
angle_degrees = angle_code * 10
print(f"Predicted angle: {angle_degrees} degrees")
Action Space
| Code | Angle | Direction |
|---|---|---|
| 0 | 0 deg | +X (right) |
| 9 | 90 deg | +Y (forward) |
| 18 | 180 deg | -X (left) |
| 27 | 270 deg | -Y (backward) |
Token mapping: token_id = vocab_size - 1 - angle_code
- Downloads last month
- 7