Qwen3-32B-R2EGYM-256-3epochs

This model is a reinforcement learning fine-tuned version of Qwen/Qwen3-32B, trained using the SkyRL framework with fully asynchronous PPO on coding and reasoning tasks from the R2EGYM benchmark.

Training Details

Framework

  • Training Framework: SkyRL (fully async PPO)
  • Parallelism Strategy: FSDP2 with CPU offload
  • Agent: Terminus-2 (terminal-based coding agent with thinking enabled)

Dataset

Hyperparameters

Parameter Value
Epochs 3
Total steps 12 (4 steps/epoch)
Learning rate 1e-5
Weight decay 0.0
Train batch size 64
Micro train batch size per GPU 1
Advantage estimator rloo_n
KL loss disabled
Samples per prompt 8
Max prompt length 2,048
Max generate length 30,720
RoPE scaling yarn (factor=4.0, original_max_position_embeddings=32,768)

Infrastructure

Component Configuration
Policy nodes 4 nodes x 4 GPUs
Reference model nodes 4 nodes x 4 GPUs
Inference engines 26 (tensor parallelism = 2)
Parallel generation workers 96
Concurrent sandbox trials 96
Total training nodes 17

Training Notes

  • Training was resumed from a step-9 checkpoint
  • The model uses Terminus-2, a terminal-based coding agent that interacts with sandboxed Docker environments to solve programming tasks
  • Thinking mode was enabled during training (--enable_thinking)

Usage

This model can be used as a drop-in replacement for Qwen3-32B with improved coding and reasoning capabilities.

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("laion/Qwen3-32B-R2EGYM-256-3epochs")
tokenizer = AutoTokenizer.from_pretrained("laion/Qwen3-32B-R2EGYM-256-3epochs")
Downloads last month
33
Safetensors
Model size
33B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for laion/Qwen3-32B-R2EGYM-256-3epochs

Base model

Qwen/Qwen3-32B
Finetuned
(408)
this model