You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Model Card: SSI-BERT-v1

Surgical Site Infection Detection from Clinical Notes

Model Details

Model Architecture

Base Model: BERT (Bidirectional Encoder Representations from Transformers)
Model Name: bert-base-uncased
HuggingFace ID: google-bert/bert-base-uncased
Task: Binary Classification (SSI Detection)
Fine-tuned: Yes
Model Size: 340 MB
Parameters: 110M

Training Configuration

Framework: PyTorch 2.7.0+
GPU: NVIDIA RTX 5070 Ti (16GB VRAM)
Precision: BF16 (Mixed precision)
Optimizer: AdamW
Learning Rate: 2e-5
Batch Size: 32 (with gradient accumulation)
Epochs: 3
Training Time: ~5-6 hours
Training Date: 2025-01-15

Tokenizer

Type: WordPiece
Vocabulary Size: 30,522
Max Sequence Length: 512 tokens
Special Tokens: [CLS], [SEP], [PAD], [UNK]

Intended Use

Primary Use Case

Epidemiological surveillance of surgical site infections (SSI) from clinical notes in healthcare systems. This model is designed for monitoring and trend detection, not clinical decision support.

Use Context

Post-operative clinical notes (0-30 days post-surgery)
Batch processing of clinical documentation
Surveillance alert generation
Procedure-specific SSI rate tracking

Appropriate Uses

✓ Identifying potential SSI cases for further review
✓ Tracking SSI incidence trends across departments/procedures
✓ Flagging high-risk cases for clinician review
✓ Epidemiological research and surveillance

Inappropriate Uses

✗ Standalone clinical diagnosis
✗ Real-time patient triage decisions
✗ Treatment recommendations
✗ Automated patient management without human review

Performance Metrics

Validation Results (Synthetic Data)

Accuracy: 0.8900
Precision: 0.8500
Recall (Sensitivity): 0.8800
Specificity: ~0.8500
F1 Score: 0.8650
AUC-ROC: 0.9200
Dataset Size: 200,000 test samples

Performance Notes

Metrics calculated on held-out test set (synthetic data)
Real-world performance expected to vary (±5-10%)
Model optimized for recall (catching SSI cases)
12% false negative rate acceptable for surveillance use
15% false positive rate manageable with clinician review

Threshold Analysis

Default Threshold: 0.50
Surveillance Threshold: 0.45 (optimized for sensitivity)
Conservative Threshold: 0.60 (high precision)

Training Data

Data Source

Synthetic clinical notes generated for validation purposes

Data Composition

Total Training Samples: 1,000,000
SSI Cases (Positive): 150,000 (15%)
Normal Cases (Negative): 850,000 (85%)
Train/Val/Test Split: 70% / 15% / 15%

Data Characteristics

Post-operative clinical notes (0-30 days post-surgery)
12 surgical procedure types represented
Clinical terminology and medical abbreviations
Vital signs and clinical findings
Note length: 300-2000 characters (avg 800)

Limitations

Synthetic Generation: Notes generated using templates
Not Trained on Real Clinical Data: Performance on real clinical notes may differ
English Only: No multi-language support
US Healthcare Context: Terminology based on US clinical practice

Limitations and Bias

Known Limitations

Domain Shift: Trained on synthetic data; real clinical text may have different patterns
Class Imbalance: Model trained with 15% SSI prevalence (real ~5-10%)
Language: English-only, US healthcare context
Temporal Bias: No temporal ordering in training (shuffled data)
Procedure Coverage: Limited to 12 procedure types
Post-operative Window: Optimized for 0-30 days post-op only

Potential Biases

Clinical Documentation Style: Model may perform differently across hospitals with different documentation practices
Terminology Variation: May struggle with rare/novel clinical abbreviations
Provider Bias: Performance may vary by note author/department

Generalization

Not validated on external datasets
Expected performance drop on out-of-distribution data
Requires validation on real clinical data before deployment

Ethical Considerations

Fairness

Model developed for epidemiological surveillance, not individual diagnosis
Not intended for resource allocation decisions
Should not be sole factor in clinical decisions

Transparency

Decision threshold can be adjusted for sensitivity/specificity trade-off
Model provides probability scores for human interpretation
Predictions should always be reviewed by clinicians

Safety

Model designed as surveillance tool, not clinical decision support
Includes explicit warnings against standalone clinical use
Requires human-in-the-loop for alert validation

Privacy

Model does not store patient data
De-identified text input only
No identifiable information in model outputs

Model Inputs and Outputs

Input

{
  "text": "Clinical note text here...",
  "threshold": 0.5
}

Output

{
  "ssi_probability": 0.8234,
  "label": 1,
  "prediction": "SSI",
  "threshold": 0.5,
  "timestamp": "2025-01-15T16:04:43"
}

Input Constraints

Text length: 50-10,000 characters
Language: English only
Format: Plain text clinical notes
Context: Post-operative (0-30 days)

Output Interpretation

ssi_probability (0-1): Confidence score for SSI presence
label (0 or 1): Binary classification
prediction: Human-readable class label
Scores <0.4: Likely negative
Scores 0.4-0.6: Uncertain (requires review)
Scores >0.6: Likely positive

Training and Evaluation

Training Parameters

Model: bert-base-uncased
Optimizer: adamw_torch
Learning Rate: 2e-5
Batch Size: 32
Gradient Accumulation: 2
Gradient Checkpointing: True
Mixed Precision: BF16
Warmup Steps: 100
Max Grad Norm: 1.0
Weight Decay: 0.01

Evaluation Methodology

Stratified train/val/test split (70/15/15)
Class-weighted metrics due to imbalance
Threshold optimization on validation set
Held-out test set evaluation

Hardware

GPU: NVIDIA RTX 5070 Ti (16GB VRAM)
CPU: Multi-core processor
RAM: 16GB+
Storage: 1GB for model + dependencies

Model Versioning

Version: 1.0.0

Release Date: 2025-01-15
Status: Beta/Prototype
Base Model: bert-base-uncased
Training Epochs: 3
Data: Synthetic (1M samples)
Validation: Synthetic test set

Future Versions

v1.1: Expected after real clinical data validation
v2.0: Planned with ClinicalBERT base model
v2.1: Multi-procedure optimization

How to Use

Installation

pip install transformers torch

Local Inference

from transformers import BertTokenizer, BertForSequenceClassification
import torch

model_path = "output/models/ssi-bert-pipeline/initial/final"
tokenizer = BertTokenizer.from_pretrained(model_path)
model = BertForSequenceClassification.from_pretrained(model_path)

text = "Clinical note here..."
inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=512)

with torch.no_grad():
    outputs = model(**inputs)
    probability = torch.softmax(outputs.logits, dim=1)[0, 1].item()

Via API

curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{"text":"Clinical note text", "threshold":0.5}'

Batch Processing

python cli.py monitor \
  --model-path output/models/ssi-bert-pipeline/initial/final \
  --data data/clinical_notes.csv \
  --period january_2024 \
  --save-predictions

Citation

If you use this model, please cite:

@model{ssi_bert_v1_2025,
  title={SSI-BERT-v1: BERT-based Surgical Site Infection Detection},
  author={Daryn Sutton/Ch3DS},
  year={2025},
  month={January},
  note={Trained on synthetic clinical notes for epidemiological surveillance}
}

License

Model: Apache 2.0 (inherited from BERT-base-uncased) Documentation: CC-BY-4.0

Changelog

Version 1.0.0 (2025-01-15)

Initial release
Trained on 1M synthetic clinical notes
Validated on 200k test samples
Performance: 89% accuracy, 92% AUC-ROC

Contact and Support

For questions or issues:

Documentation: See README.md
Issue Tracking: github.com/Ch3w3y/SSIBERT
Email: [email protected]

Disclaimer

This model is provided for research and surveillance purposes only. It is not intended for clinical diagnosis or treatment decisions. Always consult with qualified healthcare professionals for clinical decisions. The developers assume no liability for misuse or unintended consequences.

Model Card Last Updated: 2025-01-15
Model Version: 1.0.0
Status: Beta (Pre-production)

Downloads last month: 19

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for Ch3DS/ssi-bert-v1

Base model

google-bert/bert-base-uncased

Finetuned

(6207)

this model