YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
VINE: Video Understanding with Natural Language
VINE is a video understanding model that processes videos along with categorical, unary, and binary keywords to return probability distributions over those keywords for detected objects and their relationships.
π One-Command Setup
wget https://huggingface.co/video-fm/vine/resolve/main/setup_vine_complete.sh
bash setup_vine_complete.sh
That's it! This single script installs everything you need:
- β Python environment with all dependencies
- β SAM2 and GroundingDINO packages
- β All model checkpoints (~800 MB)
- β VINE model from HuggingFace (~1.8 GB)
Total time: 10-15 minutes | Total size: ~2.6 GB
See QUICKSTART.md for detailed instructions.
Quick Example
from transformers import AutoModel
from vine_hf import VinePipeline
from pathlib import Path
# Load VINE from HuggingFace
model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)
# Create pipeline (checkpoints downloaded by setup script)
checkpoint_dir = Path("checkpoints")
pipeline = VinePipeline(
model=model,
tokenizer=None,
sam_config_path=str(checkpoint_dir / "sam2_hiera_t.yaml"),
sam_checkpoint_path=str(checkpoint_dir / "sam2_hiera_tiny.pt"),
gd_config_path=str(checkpoint_dir / "GroundingDINO_SwinT_OGC.py"),
gd_checkpoint_path=str(checkpoint_dir / "groundingdino_swint_ogc.pth"),
device="cuda",
trust_remote_code=True
)
# Process video
results = pipeline(
'video.mp4',
categorical_keywords=['person', 'dog', 'ball'],
unary_keywords=['running', 'jumping'],
binary_keywords=['chasing', 'next to'],
return_top_k=5
)
print(results['summary'])
Features
- Categorical Classification: Classify objects in videos (e.g., "human", "dog", "frisbee")
- Unary Predicates: Detect actions on single objects (e.g., "running", "jumping", "sitting")
- Binary Relations: Detect relationships between object pairs (e.g., "behind", "chasing")
- Multi-Modal: Combines vision (CLIP) with text-based segmentation (GroundingDINO + SAM2)
- Visualizations: Optional annotated video outputs
Architecture
VINE uses a modular architecture:
HuggingFace Hub (video-fm/vine)
βββ VINE model weights (~1.8 GB)
β βββ Categorical CLIP (object classification)
β βββ Unary CLIP (single-object actions)
β βββ Binary CLIP (object relationships)
βββ Architecture files
User Environment (via setup script)
βββ Dependencies: laser, sam2, groundingdino
βββ Checkpoints: SAM2 (~149 MB), GroundingDINO (~662 MB)
This separation allows:
- β Lightweight model distribution
- β User control over checkpoint versions
- β Flexible deployment options
- β Standard HuggingFace practices
What the Setup Script Does
# 1. Creates conda environment (vine_demo)
# 2. Installs PyTorch with CUDA
# 3. Clones repositories:
# - video-sam2 (SAM2 package)
# - GroundingDINO (object detection)
# - LASER (video utilities)
# - vine_hf (VINE interface)
# 4. Installs packages in editable mode
# 5. Downloads model checkpoints:
# - sam2_hiera_tiny.pt (~149 MB)
# - groundingdino_swint_ogc.pth (~662 MB)
# - Config files
# 6. Tests the installation
Manual Installation
If you prefer manual installation or need to customize:
1. Create Environment
conda create -n vine_demo python=3.10 -y
conda activate vine_demo
pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu126
2. Install Dependencies
pip install transformers huggingface-hub safetensors opencv-python pillow
3. Clone and Install Packages
git clone https://github.com/video-fm/video-sam2.git
git clone https://github.com/video-fm/GroundingDINO.git
git clone https://github.com/kevinxuez/LASER.git
git clone https://github.com/kevinxuez/vine_hf.git
pip install -e ./video-sam2
pip install -e ./GroundingDINO
pip install -e ./LASER
pip install -e ./vine_hf
cd GroundingDINO && python setup.py build_ext --inplace && cd ..
4. Download Checkpoints
mkdir checkpoints && cd checkpoints
# SAM2
wget https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_tiny.pt
wget https://raw.githubusercontent.com/facebookresearch/sam2/main/sam2/configs/sam2.1/sam2.1_hiera_t.yaml -O sam2_hiera_t.yaml
# GroundingDINO
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
wget https://raw.githubusercontent.com/IDEA-Research/GroundingDINO/main/groundingdino/config/GroundingDINO_SwinT_OGC.py
Output Format
{
"categorical_predictions": {
object_id: [(probability, category), ...]
},
"unary_predictions": {
(frame_id, object_id): [(probability, action), ...]
},
"binary_predictions": {
(frame_id, (obj1_id, obj2_id)): [(probability, relation), ...]
},
"summary": {
"num_objects_detected": int,
"top_categories": [(category, probability), ...],
"top_actions": [(action, probability), ...],
"top_relations": [(relation, probability), ...]
}
}
Advanced Usage
Custom Segmentation
# Use your own masks and bounding boxes
results = model.predict(
video_frames=frames,
masks=your_masks,
bboxes=your_bboxes,
categorical_keywords=['person', 'dog'],
unary_keywords=['running'],
binary_keywords=['chasing']
)
SAM2 Only (No GroundingDINO)
config = VineConfig(
segmentation_method="sam2", # Uses SAM2 automatic mask generation
...
)
Enable Visualizations
results = pipeline(
'video.mp4',
categorical_keywords=['person', 'dog'],
include_visualizations=True, # Creates annotated video
return_top_k=5
)
# Access annotated video
video_path = results['visualizations']['vine']['all']['video_path']
Configuration
from vine_hf import VineConfig
config = VineConfig(
model_name="openai/clip-vit-base-patch32", # CLIP backbone
segmentation_method="grounding_dino_sam2", # or "sam2"
box_threshold=0.35, # Detection threshold
text_threshold=0.25, # Text matching threshold
target_fps=5, # Video sampling rate
visualize=True, # Enable visualizations
visualization_dir="outputs/", # Output directory
device="cuda:0" # Device
)
System Requirements
- OS: Linux (Ubuntu 20.04+)
- Python: 3.10+
- CUDA: 11.8+ (for GPU)
- GPU: 8GB+ VRAM (T4, V100, A100)
- RAM: 16GB+
- Disk: ~5GB free
Troubleshooting
CUDA Not Available
import torch
print(torch.cuda.is_available()) # Should be True
Import Errors
conda activate vine_demo
pip list | grep -E "laser|sam2|groundingdino"
Checkpoint Not Found
ls -lh checkpoints/
# Should show: sam2_hiera_tiny.pt, groundingdino_swint_ogc.pth
See QUICKSTART.md for detailed troubleshooting.
Example Applications
Sports Analysis
results = pipeline(
'soccer_game.mp4',
categorical_keywords=['player', 'ball', 'referee'],
unary_keywords=['running', 'kicking', 'jumping'],
binary_keywords=['passing', 'tackling', 'defending']
)
Surveillance
results = pipeline(
'security_feed.mp4',
categorical_keywords=['person', 'vehicle', 'bag'],
unary_keywords=['walking', 'running', 'standing'],
binary_keywords=['approaching', 'following', 'carrying']
)
Animal Behavior
results = pipeline(
'wildlife.mp4',
categorical_keywords=['lion', 'zebra', 'elephant'],
unary_keywords=['eating', 'walking', 'resting'],
binary_keywords=['hunting', 'fleeing', 'protecting']
)
Deployment
Gradio Demo
import gradio as gr
def analyze_video(video, categories, actions, relations):
results = pipeline(
video,
categorical_keywords=categories.split(','),
unary_keywords=actions.split(','),
binary_keywords=relations.split(',')
)
return results['summary']
gr.Interface(analyze_video, ...).launch()
FastAPI Server
from fastapi import FastAPI
app = FastAPI()
model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)
pipeline = VinePipeline(model=model, ...)
@app.post("/analyze")
async def analyze(video_path: str, keywords: dict):
return pipeline(video_path, **keywords)
Files in This Repository
setup_vine_complete.sh- One-command setup scriptQUICKSTART.md- Quick start guideREADME.md- This file (complete documentation)vine_config.py- VineConfig classvine_model.py- VineModel classvine_pipeline.py- VinePipeline classflattening.py- Segment processing utilitiesvis_utils.py- Visualization utilities
Citation
@article{laser2024,
title={LASER: Language-guided Object Grounding and Relation Understanding in Videos},
author={Your Authors},
journal={Your Conference/Journal},
year={2024}
}
License
This model is released under the MIT License. Note that SAM2 and GroundingDINO have their own respective licenses.
Links
- Model: https://huggingface.co/video-fm/vine
- Quick Start: QUICKSTART.md
- Setup Script: setup_vine_complete.sh
- LASER GitHub: https://github.com/kevinxuez/LASER
- Issues: https://github.com/kevinxuez/LASER/issues
Support
- Questions: HuggingFace Discussions
- Bugs: GitHub Issues
Made with β€οΈ by the LASER team
- Downloads last month
- 4
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support