# PlasmidGPT (Addgene GPT-2 Compatible Version)

This is a **compatibility-enhanced version** of [PlasmidGPT](https://github.com/lingxusb/PlasmidGPT) by Bin Shao (lingxusb), optimized for easier integration with modern transformers libraries and HuggingFace infrastructure.

## 🔬 About PlasmidGPT

PlasmidGPT is a generative language model pretrained on 153,000 engineered plasmid sequences from [Addgene](https://www.addgene.org/). It generates de novo plasmid sequences that share similar characteristics with engineered plasmids while maintaining low sequence identity to training data. The model can generate plasmids in a controlled manner based on input sequences or specific design constraints, and learns informative embeddings for both engineered and natural plasmids.

**Original work:** [PlasmidGPT: a generative framework for plasmid design and annotation](https://www.biorxiv.org/content/10.1101/2024.09.30.615762v1)  
**Original repository:** [github.com/lingxusb/PlasmidGPT](https://github.com/lingxusb/PlasmidGPT)  
**Original model:** [huggingface.co/lingxusb/PlasmidGPT](https://huggingface.co/lingxusb/PlasmidGPT)

### Key Features

- **Novel Sequence Generation**: Generates novel plasmid sequences rather than replicating training data
- **Conditional Generation**: Supports generation based on user-specified starting sequences
- **Versatile Predictions**: Predicts sequence-related attributes including lab of origin, species, and vector type
- **Transformer Architecture**: Decoder-only transformer with 12 layers and 110 million parameters

## 🆚 Differences from Original

This version provides:
- ✅ Native HuggingFace `transformers` compatibility (no custom loading required)
- ✅ Standard model format (`model.safetensors` instead of `.pt`)
- ✅ Direct `AutoModel` and `AutoTokenizer` support
- ✅ Simplified installation and usage

## 📦 Installation

```bash
pip install torch transformers
```

## 🚀 Quick Start

### Basic Sequence Generation

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

device = 'cuda' if torch.cuda.is_available() else 'cpu'

model = AutoModelForCausalLM.from_pretrained(
    "McClain/plasmidgpt-addgene-gpt2",
    trust_remote_code=True
).to(device)
model.eval()

tokenizer = AutoTokenizer.from_pretrained(
    "McClain/plasmidgpt-addgene-gpt2",
    trust_remote_code=True
)

start_sequence = 'ATGGCTAGCGAATTCGGCGCGCCT'
input_ids = tokenizer.encode(start_sequence, return_tensors='pt').to(device)

outputs = model.generate(
    input_ids,
    max_length=300,
    num_return_sequences=1,
    temperature=1.0,
    do_sample=True,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id
)

generated_sequence = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Generated sequence: {generated_sequence}")
```

### Generate Multiple Sequences

```python
outputs = model.generate(
    input_ids,
    max_length=500,
    num_return_sequences=5,
    temperature=1.2,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id
)

for i, output in enumerate(outputs):
    sequence = tokenizer.decode(output, skip_special_tokens=True)
    print(f"Sequence {i+1}: {sequence[:100]}...")
```

### Extract Embeddings

```python
model.config.output_hidden_states = True

with torch.no_grad():
    input_ids = tokenizer.encode("ATGCGTACG...", return_tensors='pt').to(device)
    outputs = model(input_ids)
    hidden_states = outputs.hidden_states[-1]
    embedding = hidden_states.mean(dim=1).cpu().numpy()
    
print(f"Embedding shape: {embedding.shape}")
```

## 🎯 Use Cases

- **Plasmid Design**: Generate novel plasmid sequences for synthetic biology applications
- **Sequence Analysis**: Extract meaningful embeddings for downstream ML tasks
- **Feature Prediction**: Predict properties like lab of origin, species, or vector type
- **Conditional Generation**: Create sequences starting from specific promoters or genes

## 📊 Model Details

| Parameter | Value |
|-----------|-------|
| **Architecture** | GPT-2 (Decoder-only Transformer) |
| **Parameters** | 110 million |
| **Layers** | 12 |
| **Hidden Size** | 768 |
| **Attention Heads** | 12 |
| **Context Length** | 2048 tokens |
| **Vocabulary Size** | 30,002 |
| **Training Data** | 153k Addgene plasmid sequences |

## 📚 Citation

If you use this model, please cite the original PlasmidGPT paper:

```bibtex
@article{shao2024plasmidgpt,
  title={PlasmidGPT: a generative framework for plasmid design and annotation},
  author={Shao, Bin and others},
  journal={bioRxiv},
  year={2024},
  doi={10.1101/2024.09.30.615762},
  url={https://www.biorxiv.org/content/10.1101/2024.09.30.615762v1}
}
```

## 📄 License

This model inherits the license from the original PlasmidGPT repository. Please refer to the [original repository](https://github.com/lingxusb/PlasmidGPT) for licensing details.

## 🙏 Credits

**Original Author:** Bin Shao (lingxusb)  
**Original Work:** [PlasmidGPT GitHub Repository](https://github.com/lingxusb/PlasmidGPT)  
**Paper:** [bioRxiv 2024.09.30.615762](https://www.biorxiv.org/content/10.1101/2024.09.30.615762v1)

This compatibility version was created to facilitate easier integration with modern ML workflows while preserving all capabilities of the original model.

## 🔗 Related Resources

- [Original PlasmidGPT Repository](https://github.com/lingxusb/PlasmidGPT)
- [Original HuggingFace Model](https://huggingface.co/lingxusb/PlasmidGPT)
- [PlasmidGPT Paper (bioRxiv)](https://www.biorxiv.org/content/10.1101/2024.09.30.615762v1)
- [Addgene Plasmid Repository](https://www.addgene.org/)

## ⚠️ Notes

- The model generates DNA sequences for research purposes
- Generated sequences should be validated before experimental use
- The model was trained on Addgene plasmids and performs best on similar sequence types
- For prediction tasks (lab, species, vector type), refer to the [original repository](https://github.com/lingxusb/PlasmidGPT) for prediction model weights