| # PlasmidGPT (Addgene GPT-2 Compatible Version) |
|
|
| This is a **compatibility-enhanced version** of [PlasmidGPT](https://github.com/lingxusb/PlasmidGPT) by Bin Shao (lingxusb), optimized for easier integration with modern transformers libraries and HuggingFace infrastructure. |
|
|
| ## π¬ About PlasmidGPT |
|
|
| PlasmidGPT is a generative language model pretrained on 153,000 engineered plasmid sequences from [Addgene](https://www.addgene.org/). It generates de novo plasmid sequences that share similar characteristics with engineered plasmids while maintaining low sequence identity to training data. The model can generate plasmids in a controlled manner based on input sequences or specific design constraints, and learns informative embeddings for both engineered and natural plasmids. |
|
|
| **Original work:** [PlasmidGPT: a generative framework for plasmid design and annotation](https://www.biorxiv.org/content/10.1101/2024.09.30.615762v1) |
| **Original repository:** [github.com/lingxusb/PlasmidGPT](https://github.com/lingxusb/PlasmidGPT) |
| **Original model:** [huggingface.co/lingxusb/PlasmidGPT](https://huggingface.co/lingxusb/PlasmidGPT) |
|
|
| ### Key Features |
|
|
| - **Novel Sequence Generation**: Generates novel plasmid sequences rather than replicating training data |
| - **Conditional Generation**: Supports generation based on user-specified starting sequences |
| - **Versatile Predictions**: Predicts sequence-related attributes including lab of origin, species, and vector type |
| - **Transformer Architecture**: Decoder-only transformer with 12 layers and 110 million parameters |
|
|
| ## π Differences from Original |
|
|
| This version provides: |
| - β
Native HuggingFace `transformers` compatibility (no custom loading required) |
| - β
Standard model format (`model.safetensors` instead of `.pt`) |
| - β
Direct `AutoModel` and `AutoTokenizer` support |
| - β
Simplified installation and usage |
|
|
| ## π¦ Installation |
|
|
| ```bash |
| pip install torch transformers |
| ``` |
|
|
| ## π Quick Start |
|
|
| ### Basic Sequence Generation |
|
|
| ```python |
| import torch |
| from transformers import AutoTokenizer, AutoModelForCausalLM |
| |
| device = 'cuda' if torch.cuda.is_available() else 'cpu' |
| |
| model = AutoModelForCausalLM.from_pretrained( |
| "McClain/plasmidgpt-addgene-gpt2", |
| trust_remote_code=True |
| ).to(device) |
| model.eval() |
| |
| tokenizer = AutoTokenizer.from_pretrained( |
| "McClain/plasmidgpt-addgene-gpt2", |
| trust_remote_code=True |
| ) |
| |
| start_sequence = 'ATGGCTAGCGAATTCGGCGCGCCT' |
| input_ids = tokenizer.encode(start_sequence, return_tensors='pt').to(device) |
| |
| outputs = model.generate( |
| input_ids, |
| max_length=300, |
| num_return_sequences=1, |
| temperature=1.0, |
| do_sample=True, |
| pad_token_id=tokenizer.pad_token_id, |
| eos_token_id=tokenizer.eos_token_id |
| ) |
| |
| generated_sequence = tokenizer.decode(outputs[0], skip_special_tokens=True) |
| print(f"Generated sequence: {generated_sequence}") |
| ``` |
|
|
| ### Generate Multiple Sequences |
|
|
| ```python |
| outputs = model.generate( |
| input_ids, |
| max_length=500, |
| num_return_sequences=5, |
| temperature=1.2, |
| do_sample=True, |
| top_k=50, |
| top_p=0.95, |
| pad_token_id=tokenizer.pad_token_id, |
| eos_token_id=tokenizer.eos_token_id |
| ) |
| |
| for i, output in enumerate(outputs): |
| sequence = tokenizer.decode(output, skip_special_tokens=True) |
| print(f"Sequence {i+1}: {sequence[:100]}...") |
| ``` |
|
|
| ### Extract Embeddings |
|
|
| ```python |
| model.config.output_hidden_states = True |
| |
| with torch.no_grad(): |
| input_ids = tokenizer.encode("ATGCGTACG...", return_tensors='pt').to(device) |
| outputs = model(input_ids) |
| hidden_states = outputs.hidden_states[-1] |
| embedding = hidden_states.mean(dim=1).cpu().numpy() |
| |
| print(f"Embedding shape: {embedding.shape}") |
| ``` |
|
|
| ## π― Use Cases |
|
|
| - **Plasmid Design**: Generate novel plasmid sequences for synthetic biology applications |
| - **Sequence Analysis**: Extract meaningful embeddings for downstream ML tasks |
| - **Feature Prediction**: Predict properties like lab of origin, species, or vector type |
| - **Conditional Generation**: Create sequences starting from specific promoters or genes |
|
|
| ## π Model Details |
|
|
| | Parameter | Value | |
| |-----------|-------| |
| | **Architecture** | GPT-2 (Decoder-only Transformer) | |
| | **Parameters** | 110 million | |
| | **Layers** | 12 | |
| | **Hidden Size** | 768 | |
| | **Attention Heads** | 12 | |
| | **Context Length** | 2048 tokens | |
| | **Vocabulary Size** | 30,002 | |
| | **Training Data** | 153k Addgene plasmid sequences | |
|
|
| ## π Citation |
|
|
| If you use this model, please cite the original PlasmidGPT paper: |
|
|
| ```bibtex |
| @article{shao2024plasmidgpt, |
| title={PlasmidGPT: a generative framework for plasmid design and annotation}, |
| author={Shao, Bin and others}, |
| journal={bioRxiv}, |
| year={2024}, |
| doi={10.1101/2024.09.30.615762}, |
| url={https://www.biorxiv.org/content/10.1101/2024.09.30.615762v1} |
| } |
| ``` |
|
|
| ## π License |
|
|
| This model inherits the license from the original PlasmidGPT repository. Please refer to the [original repository](https://github.com/lingxusb/PlasmidGPT) for licensing details. |
|
|
| ## π Credits |
|
|
| **Original Author:** Bin Shao (lingxusb) |
| **Original Work:** [PlasmidGPT GitHub Repository](https://github.com/lingxusb/PlasmidGPT) |
| **Paper:** [bioRxiv 2024.09.30.615762](https://www.biorxiv.org/content/10.1101/2024.09.30.615762v1) |
|
|
| This compatibility version was created to facilitate easier integration with modern ML workflows while preserving all capabilities of the original model. |
|
|
| ## π Related Resources |
|
|
| - [Original PlasmidGPT Repository](https://github.com/lingxusb/PlasmidGPT) |
| - [Original HuggingFace Model](https://huggingface.co/lingxusb/PlasmidGPT) |
| - [PlasmidGPT Paper (bioRxiv)](https://www.biorxiv.org/content/10.1101/2024.09.30.615762v1) |
| - [Addgene Plasmid Repository](https://www.addgene.org/) |
|
|
| ## β οΈ Notes |
|
|
| - The model generates DNA sequences for research purposes |
| - Generated sequences should be validated before experimental use |
| - The model was trained on Addgene plasmids and performs best on similar sequence types |
| - For prediction tasks (lab, species, vector type), refer to the [original repository](https://github.com/lingxusb/PlasmidGPT) for prediction model weights |
|
|