End-to-end example for AA sequence vectorization

by ptynecki - opened Jun 29, 2022

Jun 29, 2022

•

edited Jun 29, 2022

Hello,

Thank you for wonderful research.
Would you like to share end-to-end example of AA sequence vectorization using ProtGPT2?

Including ProtGPT2 model and tokenizer loading and execution, sth like:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "nferruz/ProtGPT2"

model = AutoModelForCausalLM.from_pretrained(model_name).to('cuda:0')
tokenizer = AutoTokenizer.from_pretrained(model_name)

protein_sequences = [
    "MINDLLDISRIISGKMTLDRAEVNLTAIARQVVEEQRQAAEAKSIQLLCSTPDTNHYVFG",
    (...)
]

input_ids = tokenizer(protein_sequences, return_tensors="pt").to('cuda:0')

outputs = model(**input_ids)
(...)

Furthermore, it would be perfect to show how to handle output to get fixed length numeric vector for each protein.

This example could help other researchers to compare vector space of the proteins with other embeddings like ProtTrans or ESM.

Regards,
Piotr

nferruz

Owner Jun 29, 2022

Hi Piotr,

Thanks a lot for reaching out!

I haven't explored sequence embedding myself yet, since I trained ProtGTP2 with protein design in mind, but I'd love to see how it performs.
But, from the original GPT paper (although it wasn't explored in the GPT2 and GPT3 papers), it should be possible to embed the sequences by taking as a vector the attention heads.

Following your code, you could do:

outputs = model(input_ids, output_attentions=True)
This returns a dictionary with keys loss, logits, past_key_values, and attentions.

The attention tensor will have a shape (batch_size, num_heads, sequence_length, sequence_length),
in your case: torch.Size([1, 20, 21, 21]).

It is 21 because your 60 amino acid long sequence gets converted to a 21 token-long sequence.
This will be a problem if you want vectors of the same length after mutating a single amino acid because the number of tokens could change.

I hope this helps for now, but in the meanwhile, I'm going to try to read how to do this with autoregressive models in HuggingFace. I'll get back to you; sorry that I do not have hands-on experience to show!

Noelia

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment