# Model Quantization Guide

## Overview

This guide covers the quantization functionality integrated into the SmolLM3 fine-tuning pipeline. The system supports creating quantized versions of trained models using `torchao` and automatically uploading them to Hugging Face Hub in a unified repository structure.

## Repository Structure

With the updated pipeline, all models (main and quantized) are stored in a single repository:

```
your-username/model-name/
├── README.md (unified model card)
├── config.json
├── pytorch_model.bin
├── tokenizer.json
├── tokenizer_config.json
├── int8/ (quantized model for GPU)
│   ├── README.md
│   ├── config.json
│   └── pytorch_model.bin
└── int4/ (quantized model for CPU)
    ├── README.md
    ├── config.json
    └── pytorch_model.bin
```

## Quantization Types

### int8 Weight-Only Quantization (GPU Optimized)
- **Memory Reduction**: ~50% compared to original model
- **Speed**: Faster inference with minimal accuracy loss
- **Hardware**: GPU optimized for high-performance inference
- **Use Case**: Production deployments with GPU resources

### int4 Weight-Only Quantization (CPU Optimized)
- **Memory Reduction**: ~75% compared to original model
- **Speed**: Significantly faster inference with some accuracy trade-off
- **Hardware**: CPU optimized for deployment
- **Use Case**: Edge deployment, CPU-only environments

## Integration with Pipeline

### Automatic Quantization

The quantization process is integrated into the main training pipeline:

1. **Training**: Model is trained using the standard pipeline
2. **Model Push**: Main model is pushed to Hugging Face Hub
3. **Quantization Options**: User is prompted to create quantized versions
4. **Quantized Models**: Quantized models are created and pushed to subdirectories
5. **Unified Documentation**: Single model card covers all versions

### Pipeline Integration

The quantization step is added to `launch.sh` after the main model push:

```bash
# Step 16.5: Quantization Options
print_step "Step 16.5: Model Quantization Options"
echo "=========================================="

print_info "Would you like to create quantized versions of your model?"
print_info "Quantization reduces model size and improves inference speed."

# Ask about quantization
get_input "Create quantized models? (y/n)" "y" "CREATE_QUANTIZED"

if [ "$CREATE_QUANTIZED" = "y" ] || [ "$CREATE_QUANTIZED" = "Y" ]; then
    print_info "Quantization options:"
    print_info "1. int8_weight_only (GPU optimized, ~50% memory reduction)"
    print_info "2. int4_weight_only (CPU optimized, ~75% memory reduction)"
    print_info "3. Both int8 and int4 versions"
    
    select_option "Select quantization type:" "int8_weight_only" "int4_weight_only" "both" "QUANT_TYPE"
    
    # Create quantized models in the same repository
    python scripts/model_tonic/quantize_model.py /output-checkpoint "$REPO_NAME" \
        --quant-type "$QUANT_TYPE" \
        --device "$DEVICE" \
        --token "$HF_TOKEN" \
        --trackio-url "$TRACKIO_URL" \
        --experiment-name "${EXPERIMENT_NAME}-${QUANT_TYPE}" \
        --dataset-repo "$TRACKIO_DATASET_REPO"
fi
```

## Standalone Quantization

### Using the Standalone Script

For models already uploaded to Hugging Face Hub:

```bash
python scripts/model_tonic/quantize_standalone.py \
    "your-username/model-name" \
    "your-username/model-name" \
    --quant-type "int8_weight_only" \
    --device "auto" \
    --token "your-hf-token"
```

### Command Line Options

```bash
python scripts/model_tonic/quantize_standalone.py model_path repo_name [options]

Options:
  --quant-type {int8_weight_only,int4_weight_only,int8_dynamic}
                        Quantization type (default: int8_weight_only)
  --device DEVICE       Device for quantization (auto, cpu, cuda)
  --group-size GROUP_SIZE
                        Group size for quantization (default: 128)
  --token TOKEN         Hugging Face token
  --private             Create private repository
  --trackio-url TRACKIO_URL
                        Trackio URL for monitoring
  --experiment-name EXPERIMENT_NAME
                        Experiment name for tracking
  --dataset-repo DATASET_REPO
                        HF Dataset repository
  --save-only           Save quantized model locally without pushing to HF
```

## Loading Quantized Models

### Loading Main Model

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the main model
model = AutoModelForCausalLM.from_pretrained(
    "your-username/model-name",
    device_map="auto",
    torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained("your-username/model-name")
```

### Loading int8 Quantized Model (GPU)

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load int8 quantized model (GPU optimized)
model = AutoModelForCausalLM.from_pretrained(
    "your-username/model-name/int8",
    device_map="auto",
    torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained("your-username/model-name/int8")
```

### Loading int4 Quantized Model (CPU)

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load int4 quantized model (CPU optimized)
model = AutoModelForCausalLM.from_pretrained(
    "your-username/model-name/int4",
    device_map="cpu",
    torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained("your-username/model-name/int4")
```

## Usage Examples

### Text Generation with Quantized Model

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load quantized model
model = AutoModelForCausalLM.from_pretrained("your-username/model-name/int8")
tokenizer = AutoTokenizer.from_pretrained("your-username/model-name/int8")

# Generate text
text = "The future of artificial intelligence is"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

### Conversation with Quantized Model

```python
def chat_with_quantized_model(prompt, max_length=100):
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_new_tokens=max_length)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

response = chat_with_quantized_model("Hello, how are you today?")
print(response)
```

## Configuration Options

### Quantization Parameters

- **group_size**: Group size for quantization (default: 128)
- **device**: Target device for quantization (auto, cpu, cuda)
- **quant_type**: Type of quantization to apply

### Hardware Requirements

- **Main Model**: GPU with 8GB+ VRAM recommended
- **int8 Model**: GPU with 4GB+ VRAM
- **int4 Model**: CPU deployment possible

## Performance Comparison

| Model Type | Memory Usage | Speed | Accuracy | Use Case |
|------------|--------------|-------|----------|----------|
| Original | 100% | Baseline | Best | Development, Research |
| int8 | ~50% | Faster | Minimal loss | Production GPU |
| int4 | ~25% | Fastest | Some loss | Edge, CPU deployment |

## Best Practices

### When to Use Quantization

1. **int8 (GPU)**: When you need faster inference with minimal accuracy loss
2. **int4 (CPU)**: When deploying to CPU-only environments or edge devices
3. **Both**: When you need flexibility for different deployment scenarios

### Memory Optimization

- Use int8 for GPU deployments with memory constraints
- Use int4 for CPU deployments or very memory-constrained environments
- Consider the trade-off between speed and accuracy

### Deployment Considerations

- Test quantized models on your specific use case
- Monitor performance and accuracy in production
- Consider using the main model for development and quantized versions for deployment

## Troubleshooting

### Common Issues

1. **CUDA Out of Memory**: Reduce batch size or use int8 quantization
2. **Import Errors**: Install torchao: `pip install torchao>=0.10.0`
3. **Model Loading Errors**: Ensure the model path is correct and accessible

### Debugging

```bash
# Test quantization functionality
python tests/test_quantization.py

# Check torchao installation
python -c "import torchao; print('torchao available')"

# Verify model files
ls -la /path/to/model/
```

## Monitoring and Tracking

### Trackio Integration

Quantization events are logged to Trackio:

- `quantization_started`: When quantization begins
- `quantization_completed`: When quantization finishes
- `quantized_model_pushed`: When model is uploaded to HF Hub
- `quantization_failed`: If quantization fails

### Metrics Tracked

- Quantization type and parameters
- Model size reduction
- Upload URLs for quantized models
- Processing time and success status

## Dependencies

### Required Packages

```bash
pip install torchao>=0.10.0
pip install transformers>=4.35.0
pip install huggingface_hub>=0.16.0
```

### Optional Dependencies

```bash
pip install accelerate>=0.20.0  # For device mapping
pip install bitsandbytes>=0.41.0  # For additional quantization
```

## References

- [torchao Documentation](https://huggingface.co/docs/transformers/main/en/quantization/torchao)
- [Hugging Face Model Cards](https://huggingface.co/docs/hub/model-cards)
- [Transformers Quantization Guide](https://huggingface.co/docs/transformers/main/en/quantization)

## Support

For issues and questions:

1. Check the troubleshooting section above
2. Review the test files in `tests/test_quantization.py`
3. Open an issue on the project repository
4. Check the Trackio monitoring for detailed logs