YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

VINE: Video Understanding with Natural Language

VINE is a video understanding model that processes videos along with categorical, unary, and binary keywords to return probability distributions over those keywords for detected objects and their relationships.

🚀 One-Command Setup

wget https://huggingface.co/video-fm/vine/resolve/main/setup_vine_complete.sh
bash setup_vine_complete.sh

That's it! This single script installs everything you need:

✅ Python environment with all dependencies
✅ SAM2 and GroundingDINO packages
✅ All model checkpoints (~800 MB)
✅ VINE model from HuggingFace (~1.8 GB)

Total time: 10-15 minutes | Total size: ~2.6 GB

See QUICKSTART.md for detailed instructions.

Quick Example

from transformers import AutoModel
from vine_hf import VinePipeline
from pathlib import Path

# Load VINE from HuggingFace
model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)

# Create pipeline (checkpoints downloaded by setup script)
checkpoint_dir = Path("checkpoints")
pipeline = VinePipeline(
    model=model,
    tokenizer=None,
    sam_config_path=str(checkpoint_dir / "sam2_hiera_t.yaml"),
    sam_checkpoint_path=str(checkpoint_dir / "sam2_hiera_tiny.pt"),
    gd_config_path=str(checkpoint_dir / "GroundingDINO_SwinT_OGC.py"),
    gd_checkpoint_path=str(checkpoint_dir / "groundingdino_swint_ogc.pth"),
    device="cuda",
    trust_remote_code=True
)

# Process video
results = pipeline(
    'video.mp4',
    categorical_keywords=['person', 'dog', 'ball'],
    unary_keywords=['running', 'jumping'],
    binary_keywords=['chasing', 'next to'],
    return_top_k=5
)

print(results['summary'])

Features

Categorical Classification: Classify objects in videos (e.g., "human", "dog", "frisbee")
Unary Predicates: Detect actions on single objects (e.g., "running", "jumping", "sitting")
Binary Relations: Detect relationships between object pairs (e.g., "behind", "chasing")
Multi-Modal: Combines vision (CLIP) with text-based segmentation (GroundingDINO + SAM2)
Visualizations: Optional annotated video outputs

Architecture

VINE uses a modular architecture:

HuggingFace Hub (video-fm/vine)
├── VINE model weights (~1.8 GB)
│   ├── Categorical CLIP (object classification)
│   ├── Unary CLIP (single-object actions)
│   └── Binary CLIP (object relationships)
└── Architecture files

User Environment (via setup script)
├── Dependencies: laser, sam2, groundingdino
└── Checkpoints: SAM2 (~149 MB), GroundingDINO (~662 MB)

This separation allows:

✅ Lightweight model distribution
✅ User control over checkpoint versions
✅ Flexible deployment options
✅ Standard HuggingFace practices

What the Setup Script Does

# 1. Creates conda environment (vine_demo)
# 2. Installs PyTorch with CUDA
# 3. Clones repositories:
#    - video-sam2 (SAM2 package)
#    - GroundingDINO (object detection)
#    - LASER (video utilities)
#    - vine_hf (VINE interface)
# 4. Installs packages in editable mode
# 5. Downloads model checkpoints:
#    - sam2_hiera_tiny.pt (~149 MB)
#    - groundingdino_swint_ogc.pth (~662 MB)
#    - Config files
# 6. Tests the installation

Manual Installation

If you prefer manual installation or need to customize:

1. Create Environment

conda create -n vine_demo python=3.10 -y
conda activate vine_demo
pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu126

2. Install Dependencies

pip install transformers huggingface-hub safetensors opencv-python pillow

3. Clone and Install Packages

git clone https://github.com/video-fm/video-sam2.git
git clone https://github.com/video-fm/GroundingDINO.git
git clone https://github.com/kevinxuez/LASER.git
git clone https://github.com/kevinxuez/vine_hf.git

pip install -e ./video-sam2
pip install -e ./GroundingDINO
pip install -e ./LASER
pip install -e ./vine_hf

cd GroundingDINO && python setup.py build_ext --inplace && cd ..

4. Download Checkpoints

mkdir checkpoints && cd checkpoints

# SAM2
wget https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_tiny.pt
wget https://raw.githubusercontent.com/facebookresearch/sam2/main/sam2/configs/sam2.1/sam2.1_hiera_t.yaml -O sam2_hiera_t.yaml

# GroundingDINO
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
wget https://raw.githubusercontent.com/IDEA-Research/GroundingDINO/main/groundingdino/config/GroundingDINO_SwinT_OGC.py

Output Format

{
    "categorical_predictions": {
        object_id: [(probability, category), ...]
    },
    "unary_predictions": {
        (frame_id, object_id): [(probability, action), ...]
    },
    "binary_predictions": {
        (frame_id, (obj1_id, obj2_id)): [(probability, relation), ...]
    },
    "summary": {
        "num_objects_detected": int,
        "top_categories": [(category, probability), ...],
        "top_actions": [(action, probability), ...],
        "top_relations": [(relation, probability), ...]
    }
}

Advanced Usage

Custom Segmentation

# Use your own masks and bounding boxes
results = model.predict(
    video_frames=frames,
    masks=your_masks,
    bboxes=your_bboxes,
    categorical_keywords=['person', 'dog'],
    unary_keywords=['running'],
    binary_keywords=['chasing']
)

SAM2 Only (No GroundingDINO)

config = VineConfig(
    segmentation_method="sam2",  # Uses SAM2 automatic mask generation
    ...
)

Enable Visualizations

results = pipeline(
    'video.mp4',
    categorical_keywords=['person', 'dog'],
    include_visualizations=True,  # Creates annotated video
    return_top_k=5
)

# Access annotated video
video_path = results['visualizations']['vine']['all']['video_path']

Configuration

from vine_hf import VineConfig

config = VineConfig(
    model_name="openai/clip-vit-base-patch32",  # CLIP backbone
    segmentation_method="grounding_dino_sam2",   # or "sam2"
    box_threshold=0.35,                          # Detection threshold
    text_threshold=0.25,                         # Text matching threshold
    target_fps=5,                                # Video sampling rate
    visualize=True,                              # Enable visualizations
    visualization_dir="outputs/",                # Output directory
    device="cuda:0"                              # Device
)

System Requirements

OS: Linux (Ubuntu 20.04+)
Python: 3.10+
CUDA: 11.8+ (for GPU)
GPU: 8GB+ VRAM (T4, V100, A100)
RAM: 16GB+
Disk: ~5GB free

Troubleshooting

CUDA Not Available

import torch
print(torch.cuda.is_available())  # Should be True

Import Errors

conda activate vine_demo
pip list | grep -E "laser|sam2|groundingdino"

Checkpoint Not Found

ls -lh checkpoints/
# Should show: sam2_hiera_tiny.pt, groundingdino_swint_ogc.pth

See QUICKSTART.md for detailed troubleshooting.

Example Applications

Sports Analysis

results = pipeline(
    'soccer_game.mp4',
    categorical_keywords=['player', 'ball', 'referee'],
    unary_keywords=['running', 'kicking', 'jumping'],
    binary_keywords=['passing', 'tackling', 'defending']
)

Surveillance

results = pipeline(
    'security_feed.mp4',
    categorical_keywords=['person', 'vehicle', 'bag'],
    unary_keywords=['walking', 'running', 'standing'],
    binary_keywords=['approaching', 'following', 'carrying']
)

Animal Behavior

results = pipeline(
    'wildlife.mp4',
    categorical_keywords=['lion', 'zebra', 'elephant'],
    unary_keywords=['eating', 'walking', 'resting'],
    binary_keywords=['hunting', 'fleeing', 'protecting']
)

Deployment

Gradio Demo

import gradio as gr

def analyze_video(video, categories, actions, relations):
    results = pipeline(
        video,
        categorical_keywords=categories.split(','),
        unary_keywords=actions.split(','),
        binary_keywords=relations.split(',')
    )
    return results['summary']

gr.Interface(analyze_video, ...).launch()

FastAPI Server

from fastapi import FastAPI

app = FastAPI()
model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)
pipeline = VinePipeline(model=model, ...)

@app.post("/analyze")
async def analyze(video_path: str, keywords: dict):
    return pipeline(video_path, **keywords)

Files in This Repository

setup_vine_complete.sh - One-command setup script
QUICKSTART.md - Quick start guide
README.md - This file (complete documentation)
vine_config.py - VineConfig class
vine_model.py - VineModel class
vine_pipeline.py - VinePipeline class
flattening.py - Segment processing utilities
vis_utils.py - Visualization utilities

Citation

@article{laser2024,
  title={LASER: Language-guided Object Grounding and Relation Understanding in Videos},
  author={Your Authors},
  journal={Your Conference/Journal},
  year={2024}
}

License

This model is released under the MIT License. Note that SAM2 and GroundingDINO have their own respective licenses.

Support

Questions: HuggingFace Discussions
Bugs: GitHub Issues

Made with ❤️ by the LASER team

Downloads last month: 4

Safetensors

Model size

0.5B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

video-fm
/

vine