🧾 Model Card: nevernever69/dit-doclaynet-segmentation

🧠 Model Overview

This model is a fine-tuned version of microsoft/dit-base for document layout semantic segmentation on the DocLayNet dataset (small subset: nevernever69/small-DocLayNet-v1.1). It segments scanned document images into 11 layout categories such as title, paragraph, table, and footer.

πŸ“š Intended Uses

  • Segment document images into structured layout elements
  • Assist in downstream tasks like document OCR, archiving, and automatic annotation
  • Useful for researchers and developers working in document AI or digital humanities

🏷️ Labels (11 Classes)

ID Label Color
0 Background Black
1 Title Red
2 Paragraph Green
3 Figure Blue
4 Table Yellow
5 List Magenta
6 Header Cyan
7 Footer Dark Red
8 Page Number Dark Green
9 Footnote Dark Blue
10 Caption Olive

πŸ§ͺ Training Details

  • Base model: microsoft/dit-base
  • Dataset: nevernever69/small-DocLayNet-v1.1
  • Input size: 1025Γ—1025 (resized to 56Γ—56 masks during training)
  • Batch size: 8
  • Epochs: 2
  • Learning rate: 5e-5
  • Loss function: Cross-entropy
  • Hardware: Trained with mixed precision (fp16) on GPU

πŸ“Š Evaluation

The model shows promising results on a validation subset, capturing distinct document elements with clear boundaries. Overlay visualizations confirm precise semantic segmentation of dense and sparse regions in historical and modern documents.

πŸš€ How to Use

from transformers import AutoImageProcessor, BeitForSemanticSegmentation
from PIL import Image
import torch

# Load model
model = BeitForSemanticSegmentation.from_pretrained("nevernever69/dit-doclaynet-segmentation")
image_processor = AutoImageProcessor.from_pretrained("nevernever69/dit-doclaynet-segmentation")

# Load and preprocess image
image = Image.open("your-image.png").convert("RGB")
inputs = image_processor(images=image, return_tensors="pt").to("cuda")

# Inference
model.to("cuda").eval()
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    upsampled = torch.nn.functional.interpolate(logits, size=image.size[::-1], mode="bilinear", align_corners=False)
    mask = upsampled.argmax(dim=1).squeeze().cpu().numpy()

πŸ§‘β€πŸŽ“ Author

Created by Never @nevernever69.
Feel free to open issues or discuss improvements on the Hugging Face hub.

πŸ“ Citation

If you use this model in your work, please consider citing:

@misc{never2025doclaynetseg,
  author = {Never},
  title = {Document Layout Segmentation using DiT-base fine-tuned on DocLayNet},
  year = {2025},
  howpublished = {\url{https://huggingface.co/nevernever69/dit-doclaynet-segmentation}}
}
Downloads last month
233
Safetensors
Model size
0.2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train nevernever69/dit-doclaynet-segmentation