Image-to-Text
PEFT
Safetensors
English
vision-language
lora
floor-plan
vectorization
structured-json
cubicasa
sft

Qwen2.5-VL floor plan SFT adapter (stage 1)

Hub: mudasir13cs/qwen25-vl-3b-floorplan-sft

Improved using Qwen — LoRA supervised fine-tuning (SFT) on CubiCasa5K (CC BY‑NC 4.0), built on Qwen2.5-VL-3B-Instruct (LICENSE).

Intended non-commercial / research use, consistent with the Qwen research license and CubiCasa NC terms.

Stage 2 (optional): GRPO fine-tune — mudasir13cs/qwen25-vl-3b-floorplan-grpo. In a local training checkout, see floorplan-vlm-grpo/README.md for the same style of usage doc.

Paper & upstream training material

If you are working inside a clone of your training repo, the same script may exist locally as train_floorplan_vlm.py.

Quick install

Inference / loading only:

pip install torch torchvision transformers peft accelerate pillow

Full SFT training (matches script docstring; includes data prep deps):

pip install torch torchvision transformers trl peft datasets accelerate shapely Pillow lxml numpy tqdm huggingface_hub
# optional GPU attention: pip install flash-attn

Loading the adapter

Use Hub IDs (recommended):

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from peft import PeftModel

BASE = "Qwen/Qwen2.5-VL-3B-Instruct"
ADAPTER = "mudasir13cs/qwen25-vl-3b-floorplan-sft"

processor = AutoProcessor.from_pretrained(BASE)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    BASE, torch_dtype="auto", device_map="auto"
)
model = PeftModel.from_pretrained(model, ADAPTER)
model.eval()

If you cloned weights into this folder locally: set ADAPTER = "./floorplan-vlm-sft" (or an absolute path) instead of the Hub repo id.

Using the model (inference)

This stage defines the recommended prompts: the system message embeds the output JSON schema. Keep SYSTEM_PROMPT and USER_PROMPT aligned with train_floorplan_vlm.py constants.

User text: “Vectorize this floor plan into structured JSON with all walls, doors, windows, and rooms.”

End-to-end pattern (matches the inference test at the end of train_floorplan_vlm.py):

import json, re, torch
from PIL import Image

SYSTEM_PROMPT = (
    "You are a floor plan vectorization expert. Extract wall, door, window geometry "
    "from floor plan images into structured JSON.\n\n"
    "Output ONLY valid JSON with this schema:\n"
    '{"walls":[{"id":"wall_N","start":[x,y],"end":[x,y],"thickness":T,"curvature":0,'
    '"openings":[{"type":"door"|"window","center":D,"width":W}]}],'
    '"rooms":[{"label":"room_type","walls":["wall_N",...]}]}\n\n'
    "Coordinates normalized so longer image edge = 1024."
)
USER_PROMPT = "Vectorize this floor plan into structured JSON with all walls, doors, windows, and rooms."

image = Image.open("plan.png").convert("RGB")

messages = [
    {"role": "system", "content": [{"type": "text", "text": SYSTEM_PROMPT}]},
    {"role": "user", "content": [{"type": "image"}, {"type": "text", "text": USER_PROMPT}]},
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
inputs = {k: v.to(model.device) if hasattr(v, "to") else v for k, v in inputs.items()}

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=4096, do_sample=False)

raw = processor.batch_decode(out[:, inputs.input_ids.shape[1] :], skip_special_tokens=True)[0]
m = re.search(r"\{[\s\S]*\}", raw)
plan = json.loads(m.group()) if m else None

Output shape: top-level walls (with optional openings) and rooms. Example JSON is documented under Output JSON Schema in the Manitocross training README.

Reproducing stage 1

  1. huggingface-cli login if you push to Hub (PUSH_TO_HUB in script).
  2. Run train_floorplan_vlm.py: first run downloads CubiCasa5K from Zenodo into ./cubicasa_data (~5GB). Tune NUM_EPOCHS, MAX_SAMPLES, LEARNING_RATE, HUB_MODEL_ID, etc. in the configuration block at the top of that file.

Training details

  • Base model: Qwen/Qwen2.5-VL-3B-Instruct
  • Dataset: CubiCasa5K · Zenodo
  • Method: LoRA SFT (SFTTrainer / TRL; see script)
  • Supervision targets: SVG-derived structured JSON (walls, rooms, openings)

Add your own run metadata: epochs, LR, seed, GPU, date, partial MAX_SAMPLES if applicable.

Citation

@article{floorplanvlm2026,
  title={FloorplanVLM: A Vision-Language Model for Floorplan Vectorization},
  journal={arXiv preprint arXiv:2602.06507},
  year={2026}
}

Acknowledgments

Author / contact

Mudasir — multimodal AI, VLM fine-tuning, retrieval/RAG research, and engineering; MS AI Convergence, 숭실대학교 — Soongsil University, Seoul. More credentials, publications, and projects: mudasir13cs.github.io

Downloads last month
18
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for minemaster01/qwen25-vl-3b-floorplan-sft

Adapter
(197)
this model

Datasets used to train minemaster01/qwen25-vl-3b-floorplan-sft

Papers for minemaster01/qwen25-vl-3b-floorplan-sft