Model Card: SSI-BERT-v1
Surgical Site Infection Detection from Clinical Notes
Model Details
Model Architecture
- Base Model: BERT (Bidirectional Encoder Representations from Transformers)
- Model Name:
bert-base-uncased - HuggingFace ID:
google-bert/bert-base-uncased - Task: Binary Classification (SSI Detection)
- Fine-tuned: Yes
- Model Size: 340 MB
- Parameters: 110M
Training Configuration
- Framework: PyTorch 2.7.0+
- GPU: NVIDIA RTX 5070 Ti (16GB VRAM)
- Precision: BF16 (Mixed precision)
- Optimizer: AdamW
- Learning Rate: 2e-5
- Batch Size: 32 (with gradient accumulation)
- Epochs: 3
- Training Time: ~5-6 hours
- Training Date: 2025-01-15
Tokenizer
- Type: WordPiece
- Vocabulary Size: 30,522
- Max Sequence Length: 512 tokens
- Special Tokens: [CLS], [SEP], [PAD], [UNK]
Intended Use
Primary Use Case
Epidemiological surveillance of surgical site infections (SSI) from clinical notes in healthcare systems. This model is designed for monitoring and trend detection, not clinical decision support.
Use Context
- Post-operative clinical notes (0-30 days post-surgery)
- Batch processing of clinical documentation
- Surveillance alert generation
- Procedure-specific SSI rate tracking
Appropriate Uses
- โ Identifying potential SSI cases for further review
- โ Tracking SSI incidence trends across departments/procedures
- โ Flagging high-risk cases for clinician review
- โ Epidemiological research and surveillance
Inappropriate Uses
- โ Standalone clinical diagnosis
- โ Real-time patient triage decisions
- โ Treatment recommendations
- โ Automated patient management without human review
Performance Metrics
Validation Results (Synthetic Data)
- Accuracy: 0.8900
- Precision: 0.8500
- Recall (Sensitivity): 0.8800
- Specificity: ~0.8500
- F1 Score: 0.8650
- AUC-ROC: 0.9200
- Dataset Size: 200,000 test samples
Performance Notes
- Metrics calculated on held-out test set (synthetic data)
- Real-world performance expected to vary (ยฑ5-10%)
- Model optimized for recall (catching SSI cases)
- 12% false negative rate acceptable for surveillance use
- 15% false positive rate manageable with clinician review
Threshold Analysis
- Default Threshold: 0.50
- Surveillance Threshold: 0.45 (optimized for sensitivity)
- Conservative Threshold: 0.60 (high precision)
Training Data
Data Source
Synthetic clinical notes generated for validation purposes
Data Composition
- Total Training Samples: 1,000,000
- SSI Cases (Positive): 150,000 (15%)
- Normal Cases (Negative): 850,000 (85%)
- Train/Val/Test Split: 70% / 15% / 15%
Data Characteristics
- Post-operative clinical notes (0-30 days post-surgery)
- 12 surgical procedure types represented
- Clinical terminology and medical abbreviations
- Vital signs and clinical findings
- Note length: 300-2000 characters (avg 800)
Limitations
- Synthetic Generation: Notes generated using templates
- Not Trained on Real Clinical Data: Performance on real clinical notes may differ
- English Only: No multi-language support
- US Healthcare Context: Terminology based on US clinical practice
Limitations and Bias
Known Limitations
- Domain Shift: Trained on synthetic data; real clinical text may have different patterns
- Class Imbalance: Model trained with 15% SSI prevalence (real ~5-10%)
- Language: English-only, US healthcare context
- Temporal Bias: No temporal ordering in training (shuffled data)
- Procedure Coverage: Limited to 12 procedure types
- Post-operative Window: Optimized for 0-30 days post-op only
Potential Biases
- Clinical Documentation Style: Model may perform differently across hospitals with different documentation practices
- Terminology Variation: May struggle with rare/novel clinical abbreviations
- Provider Bias: Performance may vary by note author/department
Generalization
- Not validated on external datasets
- Expected performance drop on out-of-distribution data
- Requires validation on real clinical data before deployment
Ethical Considerations
Fairness
- Model developed for epidemiological surveillance, not individual diagnosis
- Not intended for resource allocation decisions
- Should not be sole factor in clinical decisions
Transparency
- Decision threshold can be adjusted for sensitivity/specificity trade-off
- Model provides probability scores for human interpretation
- Predictions should always be reviewed by clinicians
Safety
- Model designed as surveillance tool, not clinical decision support
- Includes explicit warnings against standalone clinical use
- Requires human-in-the-loop for alert validation
Privacy
- Model does not store patient data
- De-identified text input only
- No identifiable information in model outputs
Model Inputs and Outputs
Input
{
"text": "Clinical note text here...",
"threshold": 0.5
}
Output
{
"ssi_probability": 0.8234,
"label": 1,
"prediction": "SSI",
"threshold": 0.5,
"timestamp": "2025-01-15T16:04:43"
}
Input Constraints
- Text length: 50-10,000 characters
- Language: English only
- Format: Plain text clinical notes
- Context: Post-operative (0-30 days)
Output Interpretation
ssi_probability(0-1): Confidence score for SSI presencelabel(0 or 1): Binary classificationprediction: Human-readable class label- Scores <0.4: Likely negative
- Scores 0.4-0.6: Uncertain (requires review)
- Scores >0.6: Likely positive
Training and Evaluation
Training Parameters
Model: bert-base-uncased
Optimizer: adamw_torch
Learning Rate: 2e-5
Batch Size: 32
Gradient Accumulation: 2
Gradient Checkpointing: True
Mixed Precision: BF16
Warmup Steps: 100
Max Grad Norm: 1.0
Weight Decay: 0.01
Evaluation Methodology
- Stratified train/val/test split (70/15/15)
- Class-weighted metrics due to imbalance
- Threshold optimization on validation set
- Held-out test set evaluation
Hardware
- GPU: NVIDIA RTX 5070 Ti (16GB VRAM)
- CPU: Multi-core processor
- RAM: 16GB+
- Storage: 1GB for model + dependencies
Model Versioning
Version: 1.0.0
- Release Date: 2025-01-15
- Status: Beta/Prototype
- Base Model: bert-base-uncased
- Training Epochs: 3
- Data: Synthetic (1M samples)
- Validation: Synthetic test set
Future Versions
- v1.1: Expected after real clinical data validation
- v2.0: Planned with ClinicalBERT base model
- v2.1: Multi-procedure optimization
How to Use
Installation
pip install transformers torch
Local Inference
from transformers import BertTokenizer, BertForSequenceClassification
import torch
model_path = "output/models/ssi-bert-pipeline/initial/final"
tokenizer = BertTokenizer.from_pretrained(model_path)
model = BertForSequenceClassification.from_pretrained(model_path)
text = "Clinical note here..."
inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
probability = torch.softmax(outputs.logits, dim=1)[0, 1].item()
Via API
curl -X POST "http://localhost:8000/predict" \
-H "Content-Type: application/json" \
-d '{"text":"Clinical note text", "threshold":0.5}'
Batch Processing
python cli.py monitor \
--model-path output/models/ssi-bert-pipeline/initial/final \
--data data/clinical_notes.csv \
--period january_2024 \
--save-predictions
Citation
If you use this model, please cite:
@model{ssi_bert_v1_2025,
title={SSI-BERT-v1: BERT-based Surgical Site Infection Detection},
author={Daryn Sutton/Ch3DS},
year={2025},
month={January},
note={Trained on synthetic clinical notes for epidemiological surveillance}
}
License
Model: Apache 2.0 (inherited from BERT-base-uncased) Documentation: CC-BY-4.0
Changelog
Version 1.0.0 (2025-01-15)
- Initial release
- Trained on 1M synthetic clinical notes
- Validated on 200k test samples
- Performance: 89% accuracy, 92% AUC-ROC
Contact and Support
For questions or issues:
- Documentation: See README.md
- Issue Tracking: github.com/Ch3w3y/SSIBERT
- Email: [email protected]
Disclaimer
This model is provided for research and surveillance purposes only. It is not intended for clinical diagnosis or treatment decisions. Always consult with qualified healthcare professionals for clinical decisions. The developers assume no liability for misuse or unintended consequences.
Model Card Last Updated: 2025-01-15
Model Version: 1.0.0
Status: Beta (Pre-production)
- Downloads last month
- 19
Model tree for Ch3DS/ssi-bert-v1
Base model
google-bert/bert-base-uncased