fine-tune NeuCodec

by nmcuong - opened 8 days ago

8 days ago

Could you please guide me on how to fine-tune the NeuCodec model? If possible, I would like to fully fine-tune the model, from the encoder to the decoder. Thank you.

LenDigLearn

Digital Learning GmbH org about 23 hours ago

•

edited about 23 hours ago

Hi, sorry I was down with the flu for the past week.

In essence, neucodec is just X-Codec-2.0. Thus, you can use their codebase with minimal adjustments (compare their lightning_module.py with NeuCodec's neucodec/model.py).

What's important is that there are some parts to the model that are only used during training, namely:

fc_post_s
SemanticDecoder_module
model_spk
discriminator
spec_discriminator

However, Neuphonic actually ship the SemanticDecoder_module and fc_post_s in their pytorch_model.bin (look at NeuCodec's from_pretrained method where they filter out these modules when loading the model). The rest will have to be initialized randomly.

Since I'm not that familiar with pytorch lightning and wanted to be able to prototype quickly, I adapted the xcodec code s.t. I can use regular pytorch. I'm quite certain you could also just use the X-Codec-2.0 code as-is though. For this, you would probably have to adjust lightning_module.py s.t. it doesn't construct a fresh model (see construct_model()) and instead only constructs the parts that are missing and loads the rest from the pytorch_model.bin like in the from_pretrained method mentioned above.

For training itself, I used the default hyperparameters given by X-Codec-2.0 , however you need to change one very important thing: Since the decoder uses 24000Hz, the STFT_PARAMS for the spec_discriminator have to be adjusted:

# if you use lightning, you will probably adjust these values in the config/model/default.yaml file
STFT_PARAMS = {
  "fft_sizes": [117, 189, 309, 501, 813, 1314, 2127, 3444],
  "hop_sizes": [58, 94, 154, 250, 406, 657, 1064, 1722],
  "win_lengths": [117, 189, 309, 501, 813, 1314, 2127, 3444],
  "window": "hann_window"
}

I used the German and English subsets of Emilia-YODAS which I have in webdataset format. For batching to work, I found it easiest to just crop all audios to e.g. 3 seconds (drop the ones that are shorter if necessary).

If you have any more questions, feel free to ask.

nmcuong

about 18 hours ago

Thank you for sharing this useful information. I will experiment with it in the near future, and I would appreciate your guidance if I encounter any difficulties.

Thank you very much.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment