fine-tune NeuCodec
Could you please guide me on how to fine-tune the NeuCodec model? If possible, I would like to fully fine-tune the model, from the encoder to the decoder. Thank you.
Hi, sorry I was down with the flu for the past week.
In essence, neucodec is just X-Codec-2.0. Thus, you can use their codebase with minimal adjustments (compare their lightning_module.py with NeuCodec's neucodec/model.py).
What's important is that there are some parts to the model that are only used during training, namely:
- fc_post_s
- SemanticDecoder_module
- model_spk
- discriminator
- spec_discriminator
However, Neuphonic actually ship the SemanticDecoder_module and fc_post_s in their pytorch_model.bin (look at NeuCodec's from_pretrained method where they filter out these modules when loading the model). The rest will have to be initialized randomly.
Since I'm not that familiar with pytorch lightning and wanted to be able to prototype quickly, I adapted the xcodec code s.t. I can use regular pytorch. I'm quite certain you could also just use the X-Codec-2.0 code as-is though. For this, you would probably have to adjust lightning_module.py s.t. it doesn't construct a fresh model (see construct_model()) and instead only constructs the parts that are missing and loads the rest from the pytorch_model.bin like in the from_pretrained method mentioned above.
For training itself, I used the default hyperparameters given by X-Codec-2.0 , however you need to change one very important thing: Since the decoder uses 24000Hz, the STFT_PARAMS for the spec_discriminator have to be adjusted:
# if you use lightning, you will probably adjust these values in the config/model/default.yaml file
STFT_PARAMS = {
"fft_sizes": [117, 189, 309, 501, 813, 1314, 2127, 3444],
"hop_sizes": [58, 94, 154, 250, 406, 657, 1064, 1722],
"win_lengths": [117, 189, 309, 501, 813, 1314, 2127, 3444],
"window": "hann_window"
}
I used the German and English subsets of Emilia-YODAS which I have in webdataset format. For batching to work, I found it easiest to just crop all audios to e.g. 3 seconds (drop the ones that are shorter if necessary).
If you have any more questions, feel free to ask.
Thank you for sharing this useful information. I will experiment with it in the near future, and I would appreciate your guidance if I encounter any difficulties.
Thank you very much.