Micro-batch size of 96

#84

by jtourille - opened Nov 28, 2025

Nov 28, 2025

Hello,

Thanks for the nice work, it is a really impressive model 🚀.

I've implemented a training script using accelerate and H100 cards (94GB version). Everything is working well, even the batch-size warmup.
However, I am saturating my GPUs with micro-batch sizes of 48 sequences. In the ModernBERT paper, I see that that you are setting the micro-batch size at 96 👀. What am I missing ? I am using FA2 like you. Even if I am not saturating my unpadded sequence as you are doing in the original paper, there is not way I can fit 96*1024 tokens in a micro-batch...

Julien

jtourille

Nov 28, 2025

•

edited Nov 28, 2025

Update !

I was able to increase my micro-batch size to 88 by using gradient checkpointing.
I'll get back to you once I am able to squeeze the last 8 in the GPU.

Julien

jtourille

Nov 28, 2025

Using ZeRO-2, I am able to fit 96 sequences in one GPU. I could combine it with gradient checkpointing to fit even more sequences.
Closing the comment.

jtourille changed discussion status to closed Nov 28, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment