Qwen3-32B-R2EGYM-256-3epochs

This model is a reinforcement learning fine-tuned version of Qwen/Qwen3-32B, trained using the SkyRL framework with fully asynchronous PPO on coding and reasoning tasks from the R2EGYM benchmark.

Training Details

Framework

Training Framework: SkyRL (fully async PPO)
Parallelism Strategy: FSDP2 with CPU offload
Agent: Terminus-2 (terminal-based coding agent with thinking enabled)

Dataset

Dataset: penfever/r2egym_gpt5_codex_solved_tasks_256_subset
Number of tasks: 256
Evaluation set: OpenThoughts-TB-dev (70 tasks)

Hyperparameters

Parameter	Value
Epochs	3
Total steps	12 (4 steps/epoch)
Learning rate	1e-5
Weight decay	0.0
Train batch size	64
Micro train batch size per GPU	1
Advantage estimator	rloo_n
KL loss	disabled
Samples per prompt	8
Max prompt length	2,048
Max generate length	30,720
RoPE scaling	yarn (factor=4.0, original_max_position_embeddings=32,768)

Infrastructure

Component	Configuration
Policy nodes	4 nodes x 4 GPUs
Reference model nodes	4 nodes x 4 GPUs
Inference engines	26 (tensor parallelism = 2)
Parallel generation workers	96
Concurrent sandbox trials	96
Total training nodes	17

Training Notes

Training was resumed from a step-9 checkpoint
The model uses Terminus-2, a terminal-based coding agent that interacts with sandboxed Docker environments to solve programming tasks
Thinking mode was enabled during training (--enable_thinking)

Usage

This model can be used as a drop-in replacement for Qwen3-32B with improved coding and reasoning capabilities.

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("laion/Qwen3-32B-R2EGYM-256-3epochs")
tokenizer = AutoTokenizer.from_pretrained("laion/Qwen3-32B-R2EGYM-256-3epochs")

Downloads last month: 33

Safetensors

Model size

33B params

Tensor type

BF16

Model tree for laion/Qwen3-32B-R2EGYM-256-3epochs

Base model

Qwen/Qwen3-32B

Finetuned

(408)

this model