# PlasmidGPT (Addgene GPT-2 Compatible Version) This is a **compatibility-enhanced version** of [PlasmidGPT](https://github.com/lingxusb/PlasmidGPT) by Bin Shao (lingxusb), optimized for easier integration with modern transformers libraries and HuggingFace infrastructure. ## 🔬 About PlasmidGPT PlasmidGPT is a generative language model pretrained on 153,000 engineered plasmid sequences from [Addgene](https://www.addgene.org/). It generates de novo plasmid sequences that share similar characteristics with engineered plasmids while maintaining low sequence identity to training data. The model can generate plasmids in a controlled manner based on input sequences or specific design constraints, and learns informative embeddings for both engineered and natural plasmids. **Original work:** [PlasmidGPT: a generative framework for plasmid design and annotation](https://www.biorxiv.org/content/10.1101/2024.09.30.615762v1) **Original repository:** [github.com/lingxusb/PlasmidGPT](https://github.com/lingxusb/PlasmidGPT) **Original model:** [huggingface.co/lingxusb/PlasmidGPT](https://huggingface.co/lingxusb/PlasmidGPT) ### Key Features - **Novel Sequence Generation**: Generates novel plasmid sequences rather than replicating training data - **Conditional Generation**: Supports generation based on user-specified starting sequences - **Versatile Predictions**: Predicts sequence-related attributes including lab of origin, species, and vector type - **Transformer Architecture**: Decoder-only transformer with 12 layers and 110 million parameters ## 🆚 Differences from Original This version provides: - ✅ Native HuggingFace `transformers` compatibility (no custom loading required) - ✅ Standard model format (`model.safetensors` instead of `.pt`) - ✅ Direct `AutoModel` and `AutoTokenizer` support - ✅ Simplified installation and usage ## 📦 Installation ```bash pip install torch transformers ``` ## 🚀 Quick Start ### Basic Sequence Generation ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM device = 'cuda' if torch.cuda.is_available() else 'cpu' model = AutoModelForCausalLM.from_pretrained( "McClain/plasmidgpt-addgene-gpt2", trust_remote_code=True ).to(device) model.eval() tokenizer = AutoTokenizer.from_pretrained( "McClain/plasmidgpt-addgene-gpt2", trust_remote_code=True ) start_sequence = 'ATGGCTAGCGAATTCGGCGCGCCT' input_ids = tokenizer.encode(start_sequence, return_tensors='pt').to(device) outputs = model.generate( input_ids, max_length=300, num_return_sequences=1, temperature=1.0, do_sample=True, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id ) generated_sequence = tokenizer.decode(outputs[0], skip_special_tokens=True) print(f"Generated sequence: {generated_sequence}") ``` ### Generate Multiple Sequences ```python outputs = model.generate( input_ids, max_length=500, num_return_sequences=5, temperature=1.2, do_sample=True, top_k=50, top_p=0.95, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id ) for i, output in enumerate(outputs): sequence = tokenizer.decode(output, skip_special_tokens=True) print(f"Sequence {i+1}: {sequence[:100]}...") ``` ### Extract Embeddings ```python model.config.output_hidden_states = True with torch.no_grad(): input_ids = tokenizer.encode("ATGCGTACG...", return_tensors='pt').to(device) outputs = model(input_ids) hidden_states = outputs.hidden_states[-1] embedding = hidden_states.mean(dim=1).cpu().numpy() print(f"Embedding shape: {embedding.shape}") ``` ## 🎯 Use Cases - **Plasmid Design**: Generate novel plasmid sequences for synthetic biology applications - **Sequence Analysis**: Extract meaningful embeddings for downstream ML tasks - **Feature Prediction**: Predict properties like lab of origin, species, or vector type - **Conditional Generation**: Create sequences starting from specific promoters or genes ## 📊 Model Details | Parameter | Value | |-----------|-------| | **Architecture** | GPT-2 (Decoder-only Transformer) | | **Parameters** | 110 million | | **Layers** | 12 | | **Hidden Size** | 768 | | **Attention Heads** | 12 | | **Context Length** | 2048 tokens | | **Vocabulary Size** | 30,002 | | **Training Data** | 153k Addgene plasmid sequences | ## 📚 Citation If you use this model, please cite the original PlasmidGPT paper: ```bibtex @article{shao2024plasmidgpt, title={PlasmidGPT: a generative framework for plasmid design and annotation}, author={Shao, Bin and others}, journal={bioRxiv}, year={2024}, doi={10.1101/2024.09.30.615762}, url={https://www.biorxiv.org/content/10.1101/2024.09.30.615762v1} } ``` ## 📄 License This model inherits the license from the original PlasmidGPT repository. Please refer to the [original repository](https://github.com/lingxusb/PlasmidGPT) for licensing details. ## 🙏 Credits **Original Author:** Bin Shao (lingxusb) **Original Work:** [PlasmidGPT GitHub Repository](https://github.com/lingxusb/PlasmidGPT) **Paper:** [bioRxiv 2024.09.30.615762](https://www.biorxiv.org/content/10.1101/2024.09.30.615762v1) This compatibility version was created to facilitate easier integration with modern ML workflows while preserving all capabilities of the original model. ## 🔗 Related Resources - [Original PlasmidGPT Repository](https://github.com/lingxusb/PlasmidGPT) - [Original HuggingFace Model](https://huggingface.co/lingxusb/PlasmidGPT) - [PlasmidGPT Paper (bioRxiv)](https://www.biorxiv.org/content/10.1101/2024.09.30.615762v1) - [Addgene Plasmid Repository](https://www.addgene.org/) ## ⚠️ Notes - The model generates DNA sequences for research purposes - Generated sequences should be validated before experimental use - The model was trained on Addgene plasmids and performs best on similar sequence types - For prediction tasks (lab, species, vector type), refer to the [original repository](https://github.com/lingxusb/PlasmidGPT) for prediction model weights