IvanHU commited on Aug 6, 2025

Commit

ab79e7d

verified ·

1 Parent(s): 5575579

Upload folder using huggingface_hub

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +8 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/8k-100.sh +75 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/config.json +56 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/delta_net_1B.json +29 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/delta_net_340M.json +26 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/gated_deltanet_1B.json +22 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/gated_deltanet_340M.json +22 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/gdn_6_1_340M.json +50 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/gdn_6_1_340M_bf16.json +50 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/gdn_6_nsa_1_340M.json +53 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/gla_340M.json +24 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/gla_7B.json +25 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/gsa_340M.json +29 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/hgrn2_340M.json +20 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/mamba2_1B.json +32 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/mamba2_340M.json +32 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/mamba2_6_1_340M.json +50 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/mamba_1B.json +30 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/mamba_340M.json +30 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/samba_1B.json +52 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/sba_340m.json +18 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/transformer_1B.json +22 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/transformer_340M.json +18 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/transformer_7B.json +21 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/generation_config.json +6 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/0/error.json +1 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/0/stderr.log +573 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/0/stdout.log +0 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/1/error.json +1 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/1/stderr.log +571 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/1/stdout.log +0 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/2/error.json +1 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/2/stderr.log +571 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/2/stdout.log +0 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/3/error.json +1 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/3/stderr.log +571 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/3/stdout.log +0 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/4/error.json +1 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/4/stderr.log +571 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/4/stdout.log +0 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/5/error.json +1 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/5/stderr.log +571 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/5/stdout.log +0 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/6/error.json +1 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/6/stderr.log +571 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/6/stdout.log +0 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/7/error.json +1 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/7/stderr.log +571 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/7/stdout.log +0 -0
bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_ij1w4wht/attempt_0/0/stderr.log +3 -0

.gitattributes CHANGED Viewed

@@ -41,3 +41,11 @@ bf16-gdn_6_nsa_1_340M.json-ctx65536-steps95366-lr3e-4-decay_typelinear-decay_rat
 bf16-gdn_6_nsa_1_340M.json-ctx65536-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs1-nn1-gas2/logs/none_g0y0s4gd/attempt_0/5/stderr.log filter=lfs diff=lfs merge=lfs -text
 bf16-gdn_6_nsa_1_340M.json-ctx65536-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs1-nn1-gas2/logs/none_g0y0s4gd/attempt_0/6/stderr.log filter=lfs diff=lfs merge=lfs -text
 bf16-gdn_6_nsa_1_340M.json-ctx65536-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs1-nn1-gas2/logs/none_g0y0s4gd/attempt_0/7/stderr.log filter=lfs diff=lfs merge=lfs -text

 bf16-gdn_6_nsa_1_340M.json-ctx65536-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs1-nn1-gas2/logs/none_g0y0s4gd/attempt_0/5/stderr.log filter=lfs diff=lfs merge=lfs -text
 bf16-gdn_6_nsa_1_340M.json-ctx65536-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs1-nn1-gas2/logs/none_g0y0s4gd/attempt_0/6/stderr.log filter=lfs diff=lfs merge=lfs -text
 bf16-gdn_6_nsa_1_340M.json-ctx65536-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs1-nn1-gas2/logs/none_g0y0s4gd/attempt_0/7/stderr.log filter=lfs diff=lfs merge=lfs -text
+bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_ij1w4wht/attempt_0/0/stderr.log filter=lfs diff=lfs merge=lfs -text
+bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_ij1w4wht/attempt_0/1/stderr.log filter=lfs diff=lfs merge=lfs -text
+bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_ij1w4wht/attempt_0/2/stderr.log filter=lfs diff=lfs merge=lfs -text
+bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_ij1w4wht/attempt_0/3/stderr.log filter=lfs diff=lfs merge=lfs -text
+bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_ij1w4wht/attempt_0/4/stderr.log filter=lfs diff=lfs merge=lfs -text
+bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_ij1w4wht/attempt_0/5/stderr.log filter=lfs diff=lfs merge=lfs -text
+bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_ij1w4wht/attempt_0/6/stderr.log filter=lfs diff=lfs merge=lfs -text
+bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_ij1w4wht/attempt_0/7/stderr.log filter=lfs diff=lfs merge=lfs -text

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/8k-100.sh ADDED Viewed

	@@ -0,0 +1,75 @@

+FLAME_PATH=/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame
+DATASET_ROOT=/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset
+TOKENIZER=/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/tokenizer
+cd $FLAME_PATH
+source .venv/bin/activate
+# =========== train config ===========
+CONFIG=${1:-transformer_340M.json}
+SEQ_LEN=8192
+WARMUP_STEPS=100
+STEPS=95366
+LR=3e-4
+BATCH_SIZE=8
+GAS=2
+DECAY_TYPE=linear
+DECAY_RATIO=1
+NNODE=1
+NGPU=8
+LOG_RANK=0
+EXTRA_ARGS="--training.mixed_precision_param bfloat16"
+EXTRA_NAME="bf16"
+# ====================================
+# if jq command is not found, install it
+if ! command -v jq &> /dev/null; then
+    echo "jq could not be found, installing it..."
+    sudo yum install -y jq
+fi
+export WANDB_ERROR_REPORTING=False
+if [ -n "$EXTRA_NAME" ]; then
+    EXTRA_NAME="${EXTRA_NAME}-"
+fi
+EXP_NAME=${EXTRA_NAME}$(basename $CONFIG | sed 's/\.config//')-ctx${SEQ_LEN}-steps${STEPS}-lr${LR}-decay_type${DECAY_TYPE}-decay_ratio${DECAY_RATIO}-bs${BATCH_SIZE}-nn${NNODE}-gas${GAS}
+bash train.sh \
+ --job.config_file flame/models/fla.toml \
+ --job.dump_folder $FLAME_PATH/exp/$EXP_NAME \
+ --model.config $FLAME_PATH/configs/$CONFIG \
+ --model.tokenizer_path $TOKENIZER \
+ --optimizer.name AdamW  \
+ --optimizer.eps 1e-8  \
+ --optimizer.lr $LR  \
+ --lr_scheduler.warmup_steps $WARMUP_STEPS  \
+ --lr_scheduler.lr_min 0.01  \
+ --lr_scheduler.decay_type $DECAY_TYPE  \
+ --lr_scheduler.decay_ratio $DECAY_RATIO \
+ --training.batch_size $BATCH_SIZE  \
+ --training.seq_len $SEQ_LEN  \
+ --training.context_len $SEQ_LEN  \
+ --training.gradient_accumulation_steps $GAS  \
+ --training.steps $STEPS  \
+ --training.max_norm 1.0  \
+ --training.skip_nan_inf  \
+ --training.dataset $DATASET_ROOT/fineweb-edu-sample,$DATASET_ROOT/small_repos_20B_sample_merged,$DATASET_ROOT/megamath-web-pro  \
+ --training.data_probs 0.55,0.3,0.15  \
+ --training.dataset_split train,train,train  \
+ --training.dataset_name default,default,default  \
+ --training.streaming  \
+ --training.num_workers 32  \
+ --training.prefetch_factor 2  \
+ --training.seed 42  \
+ --training.compile  \
+ --checkpoint.interval 8192  \
+ --checkpoint.load_step -1  \
+ --checkpoint.keep_latest_k 100  \
+ --checkpoint.export_dtype bfloat16 \
+ --metrics.log_freq 1  \
+ --metrics.enable_tensorboard  \
+ --training.streaming \
+ ${EXTRA_ARGS}

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/config.json ADDED Viewed

	@@ -0,0 +1,56 @@

+{
+  "allow_neg_eigval": false,
+  "architectures": [
+    "GatedDeltaNetForCausalLM"
+  ],
+  "attn": {
+    "block_counts": 16,
+    "block_size": 64,
+    "layers": [
+      5,
+      11,
+      17,
+      23
+    ],
+    "num_heads": 32,
+    "num_kv_heads": 2,
+    "qkv_bias": false,
+    "rope_theta": 160000.0,
+    "type": "nsa",
+    "window_size": 512
+  },
+  "attn_mode": "chunk",
+  "bos_token_id": 1,
+  "conv_size": 4,
+  "eos_token_id": 2,
+  "expand_k": 1,
+  "expand_v": 1,
+  "fuse_cross_entropy": true,
+  "fuse_norm": true,
+  "fuse_swiglu": true,
+  "head_dim": 256,
+  "hidden_act": "swish",
+  "hidden_ratio": 4,
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "intermediate_size": null,
+  "max_position_embeddings": 8192,
+  "model_type": "gated_deltanet",
+  "norm_eps": 1e-06,
+  "norm_first": false,
+  "num_heads": 4,
+  "num_hidden_layers": 24,
+  "num_v_heads": null,
+  "qk_activation": "silu",
+  "qk_norm": "l2",
+  "tie_word_embeddings": false,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.53.3",
+  "use_beta": true,
+  "use_cache": true,
+  "use_gate": true,
+  "use_l2warp": false,
+  "use_output_norm": true,
+  "use_short_conv": true,
+  "vocab_size": 32000
+}

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/delta_net_1B.json ADDED Viewed

	@@ -0,0 +1,29 @@

+{
+    "attn": null,
+    "attn_mode": "chunk",
+    "bos_token_id": 1,
+    "conv_size": 4,
+    "eos_token_id": 2,
+    "expand_k": 1,
+    "expand_v": 1,
+    "fuse_cross_entropy": true,
+    "fuse_norm": true,
+    "hidden_act": "swish",
+    "hidden_ratio": 4,
+    "hidden_size": 2048,
+    "initializer_range": 0.02,
+    "intermediate_size": null,
+    "model_type": "delta_net",
+    "norm_eps": 1e-06,
+    "num_heads": 16,
+    "num_hidden_layers": 24,
+    "pad_token_id": 2,
+    "qk_activation": "silu",
+    "qk_norm": "l2",
+    "tie_word_embeddings": false,
+    "use_beta": true,
+    "use_cache": true,
+    "use_gate": false,
+    "use_output_norm": true,
+    "use_short_conv": true
+}

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/delta_net_340M.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "attn_mode": "chunk",
+    "bos_token_id": 1,
+    "conv_size": 4,
+    "eos_token_id": 2,
+    "expand_k": 1,
+    "expand_v": 1,
+    "fuse_cross_entropy": true,
+    "hidden_act": "swish",
+    "hidden_ratio": 4,
+    "hidden_size": 1024,
+    "initializer_range": 0.02,
+    "intermediate_size": null,
+    "model_type": "delta_net",
+    "norm_eps": 1e-06,
+    "num_heads": 8,
+    "num_hidden_layers": 24,
+    "qk_activation": "silu",
+    "qk_norm": "l2",
+    "tie_word_embeddings": false,
+    "use_beta": true,
+    "use_cache": true,
+    "use_gate": false,
+    "use_output_norm": true,
+    "use_short_conv": true
+}

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/gated_deltanet_1B.json ADDED Viewed

	@@ -0,0 +1,22 @@

+{
+    "attn_mode": "chunk",
+    "bos_token_id": 1,
+    "conv_size": 4,
+    "eos_token_id": 2,
+    "expand_v": 2,
+    "fuse_cross_entropy": true,
+    "head_dim": 256,
+    "hidden_act": "swish",
+    "hidden_ratio": 4,
+    "hidden_size": 2048,
+    "initializer_range": 0.02,
+    "intermediate_size": null,
+    "model_type": "gated_deltanet",
+    "norm_eps": 1e-06,
+    "num_heads": 6,
+    "num_hidden_layers": 21,
+    "tie_word_embeddings": false,
+    "use_cache": true,
+    "use_gate": true,
+    "use_short_conv": true
+}

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/gated_deltanet_340M.json ADDED Viewed

	@@ -0,0 +1,22 @@

+{
+    "attn_mode": "chunk",
+    "bos_token_id": 1,
+    "conv_size": 4,
+    "eos_token_id": 2,
+    "expand_v": 2,
+    "fuse_cross_entropy": true,
+    "head_dim": 256,
+    "hidden_act": "swish",
+    "hidden_ratio": 4,
+    "hidden_size": 1024,
+    "initializer_range": 0.02,
+    "intermediate_size": null,
+    "model_type": "gated_deltanet",
+    "norm_eps": 1e-06,
+    "num_heads": 6,
+    "num_hidden_layers": 21,
+    "tie_word_embeddings": false,
+    "use_cache": true,
+    "use_gate": true,
+    "use_short_conv": true
+}

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/gdn_6_1_340M.json ADDED Viewed

	@@ -0,0 +1,50 @@

+{
+  "architectures": [
+    "GatedDeltaNetForCausalLM"
+  ],
+  "attn": {
+    "layers": [
+      5,
+      11,
+      17,
+      23
+    ],
+    "num_heads": 16,
+    "num_kv_heads": 8,
+    "qkv_bias": false,
+    "rope_theta": 160000.0,
+    "window_size": null
+  },
+  "attn_mode": "chunk",
+  "bos_token_id": 1,
+  "conv_size": 4,
+  "eos_token_id": 2,
+  "expand_k": 1,
+  "expand_v": 1,
+  "fuse_cross_entropy": true,
+  "fuse_norm": true,
+  "fuse_swiglu": true,
+  "head_dim": 256,
+  "hidden_act": "swish",
+  "hidden_ratio": 4,
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "intermediate_size": null,
+  "max_position_embeddings": 8192,
+  "model_type": "gated_deltanet",
+  "norm_eps": 1e-06,
+  "norm_first": false,
+  "num_heads": 4,
+  "num_hidden_layers": 24,
+  "qk_activation": "silu",
+  "qk_norm": "l2",
+  "tie_word_embeddings": false,
+  "torch_dtype": "float32",
+  "transformers_version": "4.51.3",
+  "use_beta": true,
+  "use_cache": true,
+  "use_gate": true,
+  "use_output_norm": true,
+  "use_short_conv": true,
+  "vocab_size": 32000
+}

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/gdn_6_1_340M_bf16.json ADDED Viewed

	@@ -0,0 +1,50 @@

+{
+  "architectures": [
+    "GatedDeltaNetForCausalLM"
+  ],
+  "attn": {
+    "layers": [
+      5,
+      11,
+      17,
+      23
+    ],
+    "num_heads": 16,
+    "num_kv_heads": 8,
+    "qkv_bias": false,
+    "rope_theta": 160000.0,
+    "window_size": null
+  },
+  "attn_mode": "chunk",
+  "bos_token_id": 1,
+  "conv_size": 4,
+  "eos_token_id": 2,
+  "expand_k": 1,
+  "expand_v": 1,
+  "fuse_cross_entropy": true,
+  "fuse_norm": true,
+  "fuse_swiglu": true,
+  "head_dim": 256,
+  "hidden_act": "swish",
+  "hidden_ratio": 4,
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "intermediate_size": null,
+  "max_position_embeddings": 8192,
+  "model_type": "gated_deltanet",
+  "norm_eps": 1e-06,
+  "norm_first": false,
+  "num_heads": 4,
+  "num_hidden_layers": 24,
+  "qk_activation": "silu",
+  "qk_norm": "l2",
+  "tie_word_embeddings": false,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.51.3",
+  "use_beta": true,
+  "use_cache": true,
+  "use_gate": true,
+  "use_output_norm": true,
+  "use_short_conv": true,
+  "vocab_size": 32000
+}

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/gdn_6_nsa_1_340M.json ADDED Viewed

	@@ -0,0 +1,53 @@

+{
+  "architectures": [
+    "GatedDeltaNetForCausalLM"
+  ],
+  "attn": {
+    "layers": [
+      5,
+      11,
+      17,
+      23
+    ],
+    "num_heads": 32,
+    "num_kv_heads": 2,
+    "qkv_bias": false,
+    "rope_theta": 160000.0,
+    "type": "nsa",
+    "block_size": 64,
+    "block_counts": 16,
+    "window_size": 512
+  },
+  "attn_mode": "chunk",
+  "bos_token_id": 1,
+  "conv_size": 4,
+  "eos_token_id": 2,
+  "expand_k": 1,
+  "expand_v": 1,
+  "fuse_cross_entropy": true,
+  "fuse_norm": true,
+  "fuse_swiglu": true,
+  "head_dim": 256,
+  "hidden_act": "swish",
+  "hidden_ratio": 4,
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "intermediate_size": null,
+  "max_position_embeddings": 8192,
+  "model_type": "gated_deltanet",
+  "norm_eps": 1e-06,
+  "norm_first": false,
+  "num_heads": 4,
+  "num_hidden_layers": 24,
+  "qk_activation": "silu",
+  "qk_norm": "l2",
+  "tie_word_embeddings": false,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.51.3",
+  "use_beta": true,
+  "use_cache": true,
+  "use_gate": true,
+  "use_output_norm": true,
+  "use_short_conv": true,
+  "vocab_size": 32000
+}

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/gla_340M.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "attn_mode": "chunk",
+  "bos_token_id": 1,
+  "clamp_min": null,
+  "eos_token_id": 2,
+  "expand_k": 0.5,
+  "expand_v": 1,
+  "fuse_cross_entropy": true,
+  "fuse_norm": true,
+  "hidden_act": "swish",
+  "hidden_ratio": 4,
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "intermediate_size": null,
+  "model_type": "gla",
+  "num_heads": 4,
+  "num_hidden_layers": 24,
+  "norm_eps": 1e-06,
+  "tie_word_embeddings": false,
+  "use_cache": true,
+  "use_gk": true,
+  "use_gv": false,
+  "vocab_size": 32000
+}

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/gla_7B.json ADDED Viewed

	@@ -0,0 +1,25 @@

+{
+    "attn": null,
+    "attn_mode": "chunk",
+    "bos_token_id": 1,
+    "eos_token_id": 2,
+    "expand_k": 0.5,
+    "expand_v": 1,
+    "fuse_cross_entropy": true,
+    "fuse_norm": true,
+    "hidden_act": "swish",
+    "hidden_ratio": 4,
+    "hidden_size": 4096,
+    "initializer_range": 0.02,
+    "intermediate_size": 11008,
+    "model_type": "gla",
+    "norm_eps": 1e-06,
+    "num_heads": 16,
+    "num_hidden_layers": 32,
+    "tie_word_embeddings": false,
+    "use_cache": true,
+    "use_gk": true,
+    "use_gv": false,
+    "use_output_gate": true,
+    "use_short_conv": false
+}

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/gsa_340M.json ADDED Viewed

	@@ -0,0 +1,29 @@

+{
+    "bos_token_id": 1,
+    "conv_size": 4,
+    "eos_token_id": 2,
+    "expand_k": 1,
+    "expand_v": 1,
+    "elementwise_affine": false,
+    "feature_map": "swish",
+    "fuse_cross_entropy": true,
+    "fuse_norm": true,
+    "gate_logit_normalizer": 4,
+    "hidden_act": "swish",
+    "hidden_ratio": 4,
+    "hidden_size": 1024,
+    "initializer_range": 0.02,
+    "intermediate_size": null,
+    "model_type": "gsa",
+    "num_heads": 4,
+    "num_hidden_layers": 24,
+    "num_slots": 64,
+    "norm_eps": 1e-06,
+    "share_conv_kernel": true,
+    "tie_word_embeddings": false,
+    "use_cache": true,
+    "use_norm": true,
+    "use_output_gate": true,
+    "use_rope": false,
+    "use_short_conv": false
+}

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/hgrn2_340M.json ADDED Viewed

	@@ -0,0 +1,20 @@

+{
+    "attn_mode": "chunk",
+    "bos_token_id": 1,
+    "eos_token_id": 2,
+    "expand_ratio": 128,
+    "fuse_cross_entropy": true,
+    "fuse_norm": true,
+    "hidden_act": "swish",
+    "hidden_ratio": 4,
+    "hidden_size": 1024,
+    "initializer_range": 0.02,
+    "intermediate_size": null,
+    "model_type": "hgrn2",
+    "num_heads": 8,
+    "num_hidden_layers": 24,
+    "norm_eps": 1e-06,
+    "tie_word_embeddings": false,
+    "use_cache": true,
+    "vocab_size": 32000
+}

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/mamba2_1B.json ADDED Viewed

	@@ -0,0 +1,32 @@

+{
+  "bos_token_id": 1,
+  "chunk_size": 256,
+  "conv_kernel": 4,
+  "eos_token_id": 2,
+  "expand": 2,
+  "fuse_cross_entropy": true,
+  "fuse_norm": true,
+  "head_dim": 64,
+  "hidden_act": "silu",
+  "hidden_size": 2048,
+  "initializer_range": 0.02,
+  "norm_eps": 1e-05,
+  "model_type": "mamba2",
+  "n_groups": 1,
+  "num_hidden_layers": 48,
+  "pad_token_id": 0,
+  "rescale_prenorm_residual": true,
+  "residual_in_fp32": true,
+  "rms_norm": true,
+  "state_size": 128,
+  "tie_word_embeddings": false,
+  "time_step_floor": 0.0001,
+  "time_step_max": 0.1,
+  "time_step_min": 0.001,
+  "time_step_rank": 128,
+  "transformers_version": "4.50.1",
+  "use_bias": false,
+  "use_cache": true,
+  "use_conv_bias": true,
+  "vocab_size": 32000
+}

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/mamba2_340M.json ADDED Viewed

	@@ -0,0 +1,32 @@

+{
+  "bos_token_id": 1,
+  "chunk_size": 256,
+  "conv_kernel": 4,
+  "eos_token_id": 2,
+  "expand": 2,
+  "fuse_cross_entropy": true,
+  "fuse_norm": true,
+  "head_dim": 64,
+  "hidden_act": "silu",
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "norm_eps": 1e-05,
+  "model_type": "mamba2",
+  "n_groups": 1,
+  "num_hidden_layers": 48,
+  "pad_token_id": 0,
+  "rescale_prenorm_residual": true,
+  "residual_in_fp32": true,
+  "rms_norm": true,
+  "state_size": 128,
+  "tie_word_embeddings": false,
+  "time_step_floor": 0.0001,
+  "time_step_max": 0.1,
+  "time_step_min": 0.001,
+  "time_step_rank": 128,
+  "transformers_version": "4.50.1",
+  "use_bias": false,
+  "use_cache": true,
+  "use_conv_bias": true,
+  "vocab_size": 32000
+}

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/mamba2_6_1_340M.json ADDED Viewed

	@@ -0,0 +1,50 @@

+{
+  "architectures": [
+    "Mamba2ForCausalLM"
+  ],
+  "attn": {
+    "layers": [
+      5,
+      11,
+      17,
+      23
+    ],
+    "num_heads": 16,
+    "num_kv_heads": 8,
+    "qkv_bias": false,
+    "rope_theta": 160000.0,
+    "window_size": null
+  },
+  "attn_mode": "chunk",
+  "bos_token_id": 1,
+  "chunk_size": 256,
+  "conv_kernel": 4,
+  "eos_token_id": 2,
+  "expand": 2,
+  "fuse_cross_entropy": true,
+  "fuse_norm": true,
+  "fuse_swiglu": true,
+  "head_dim": 64,
+  "hidden_act": "silu",
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "norm_eps": 1e-05,
+  "model_type": "mamba2",
+  "n_groups": 1,
+  "num_hidden_layers": 48,
+  "pad_token_id": 0,
+  "rescale_prenorm_residual": true,
+  "residual_in_fp32": true,
+  "rms_norm": true,
+  "state_size": 128,
+  "tie_word_embeddings": false,
+  "time_step_floor": 0.0001,
+  "time_step_max": 0.1,
+  "time_step_min": 0.001,
+  "time_step_rank": 128,
+  "transformers_version": "4.50.1",
+  "use_bias": false,
+  "use_cache": true,
+  "use_conv_bias": true,
+  "vocab_size": 32000
+}

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/mamba_1B.json ADDED Viewed

	@@ -0,0 +1,30 @@

+{
+  "bos_token_id": 1,
+  "conv_kernel": 4,
+  "eos_token_id": 2,
+  "expand": 2,
+  "fuse_cross_entropy": true,
+  "fuse_norm": true,
+  "hidden_act": "silu",
+  "hidden_size": 2048,
+  "initializer_range": 0.02,
+  "model_type": "mamba",
+  "norm_eps": 1e-05,
+  "num_hidden_layers": 48,
+  "pad_token_id": 0,
+  "rescale_prenorm_residual": false,
+  "residual_in_fp32": false,
+  "state_size": 16,
+  "tie_word_embeddings": false,
+  "time_step_floor": 0.0001,
+  "time_step_init_scheme": "random",
+  "time_step_max": 0.1,
+  "time_step_min": 0.001,
+  "time_step_rank": 128,
+  "time_step_scale": 1.0,
+  "transformers_version": "4.50.1",
+  "use_bias": false,
+  "use_cache": true,
+  "use_conv_bias": true,
+  "vocab_size": 32000
+}

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/mamba_340M.json ADDED Viewed

	@@ -0,0 +1,30 @@

+{
+  "bos_token_id": 1,
+  "conv_kernel": 4,
+  "eos_token_id": 2,
+  "expand": 2,
+  "fuse_cross_entropy": true,
+  "fuse_norm": true,
+  "hidden_act": "silu",
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "model_type": "mamba",
+  "norm_eps": 1e-05,
+  "num_hidden_layers": 48,
+  "pad_token_id": 0,
+  "rescale_prenorm_residual": false,
+  "residual_in_fp32": false,
+  "state_size": 16,
+  "tie_word_embeddings": false,
+  "time_step_floor": 0.0001,
+  "time_step_init_scheme": "random",
+  "time_step_max": 0.1,
+  "time_step_min": 0.001,
+  "time_step_rank": 128,
+  "time_step_scale": 1.0,
+  "transformers_version": "4.50.1",
+  "use_bias": false,
+  "use_cache": true,
+  "use_conv_bias": true,
+  "vocab_size": 32000
+}

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/samba_1B.json ADDED Viewed

	@@ -0,0 +1,52 @@

+{
+  "attn": {
+    "layers": [
+      1,
+      3,
+      5,
+      7,
+      9,
+      11,
+      13,
+      15,
+      17
+    ],
+    "num_heads": 18,
+    "num_kv_heads": 18,
+    "qkv_bias": false,
+    "rope_theta": 10000.0,
+    "window_size": 2048
+  },
+  "bos_token_id": 1,
+  "conv_kernel": 4,
+  "eos_token_id": 2,
+  "expand": 2,
+  "fuse_cross_entropy": true,
+  "fuse_norm": true,
+  "fuse_swiglu": true,
+  "hidden_act": "swish",
+  "hidden_ratio": 4,
+  "hidden_size": 2304,
+  "initializer_range": 0.02,
+  "intermediate_size": 4608,
+  "max_position_embeddings": 2048,
+  "model_type": "samba",
+  "norm_eps": 1e-05,
+  "num_hidden_layers": 18,
+  "pad_token_id": 0,
+  "rescale_prenorm_residual": false,
+  "residual_in_fp32": false,
+  "state_size": 16,
+  "tie_word_embeddings": false,
+  "time_step_floor": 0.0001,
+  "time_step_init_scheme": "random",
+  "time_step_max": 0.1,
+  "time_step_min": 0.001,
+  "time_step_rank": 144,
+  "time_step_scale": 1.0,
+  "transformers_version": "4.50.1",
+  "use_bias": false,
+  "use_cache": true,
+  "use_conv_bias": true,
+  "vocab_size": 32000
+}

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/sba_340m.json ADDED Viewed

	@@ -0,0 +1,18 @@

+{
+    "attention_bias": false,
+    "bos_token_id": 1,
+    "eos_token_id": 2,
+    "fuse_cross_entropy": true,
+    "fuse_norm": true,
+    "hidden_act": "swish",
+    "hidden_size": 1024,
+    "initializer_range": 0.006,
+    "max_position_embeddings": 8192,
+    "model_type": "sba",
+    "num_heads": 16,
+    "num_hidden_layers": 24,
+    "norm_eps": 1e-06,
+    "tie_word_embeddings": false,
+    "use_cache": true,
+    "vocab_size": 32000
+}

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/transformer_1B.json ADDED Viewed

	@@ -0,0 +1,22 @@

+{
+    "bos_token_id": 1,
+    "elementwise_affine": true,
+    "eos_token_id": 2,
+    "fuse_cross_entropy": true,
+    "fuse_norm": true,
+    "fuse_swiglu": true,
+    "hidden_act": "swish",
+    "hidden_ratio": 4,
+    "hidden_size": 2048,
+    "initializer_range": 0.02,
+    "intermediate_size": null,
+    "max_position_embeddings": 8192,
+    "model_type": "transformer",
+    "norm_eps": 1e-06,
+    "num_heads": 32,
+    "num_hidden_layers": 24,
+    "num_kv_heads": null,
+    "pad_token_id": 2,
+    "rope_theta": 10000.0,
+    "tie_word_embeddings": false
+}

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/transformer_340M.json ADDED Viewed

	@@ -0,0 +1,18 @@

+{
+    "attention_bias": false,
+    "bos_token_id": 1,
+    "eos_token_id": 2,
+    "fuse_cross_entropy": true,
+    "fuse_norm": true,
+    "hidden_act": "swish",
+    "hidden_size": 1024,
+    "initializer_range": 0.02,
+    "max_position_embeddings": 8192,
+    "model_type": "transformer",
+    "num_heads": 16,
+    "num_hidden_layers": 24,
+    "norm_eps": 1e-06,
+    "tie_word_embeddings": false,
+    "use_cache": true,
+    "vocab_size": 32000
+}

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/configs/transformer_7B.json ADDED Viewed

	@@ -0,0 +1,21 @@

+{
+    "attention_bias": false,
+    "bos_token_id": 1,
+    "eos_token_id": 2,
+    "fuse_cross_entropy": true,
+    "fuse_norm": true,
+    "hidden_act": "swish",
+    "hidden_ratio": 4,
+    "hidden_size": 4096,
+    "initializer_range": 0.02,
+    "intermediate_size": 14336,
+    "model_type": "transformer",
+    "norm_eps": 1e-06,
+    "num_heads": 32,
+    "num_hidden_layers": 32,
+    "num_kv_heads": 8,
+    "rope_theta": 10000.0,
+    "tie_word_embeddings": false,
+    "use_cache": true,
+    "window_size": null
+}

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/generation_config.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "transformers_version": "4.53.3"
+}

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/0/error.json ADDED Viewed

	@@ -0,0 +1 @@

+ {"message": {"message": "InternalTorchDynamoError: TypeError: '>' not supported between instances of 'NoneType' and 'int'\n\nfrom user code:\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/ops/nsa/parallel.py\", line 862, in torch_dynamo_resume_in_parallel_nsa_at_857\n if window_size > 0:\n\nSet TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS=\"+dynamo\"\n", "extraInfo": {"py_callstack": "Traceback (most recent call last):\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py\", line 355, in wrapper\n return f(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py\", line 488, in main\n output = model(\n ^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1857, in _call_impl\n return inner()\n ^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1805, in inner\n result = forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/transformers/utils/deprecation.py\", line 172, in wrapped_func\n return func(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/gated_deltanet/modeling_gated_deltanet.py\", line 424, in forward\n outputs = self.model(\n ^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1762, in _call_impl\n return forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/gated_deltanet/modeling_gated_deltanet.py\", line 294, in forward\n hidden_states, attentions, past_key_values = layer(\n ^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1857, in _call_impl\n return inner()\n ^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1805, in inner\n result = forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py\", line 655, in _fn\n return fn(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1762, in _call_impl\n return forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/gated_deltanet/modeling_gated_deltanet.py\", line 108, in forward\n hidden_states = self.attn_norm(hidden_states)\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/gated_deltanet/modeling_gated_deltanet.py\", line 109, in torch_dynamo_resume_in_forward_at_108\n hidden_states, attentions, past_key_values = self.attn(\n ^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1751, in _wrapped_call_impl\n return self._call_impl(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py\", line 1762, in _call_impl\n return forward_call(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/nsa.py\", line 108, in forward\n q, k = self.rotary(q, k, seqlen_offset=seqlen_offset, max_seqlen=max_seqlen, cu_seqlens=cu_seqlens)\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/nsa.py\", line 123, in torch_dynamo_resume_in_forward_at_108\n o = parallel_nsa(\n ^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/ops/nsa/parallel.py\", line 838, in parallel_nsa\n o_cmp, lse_cmp = parallel_nsa_compression(\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/ops/nsa/parallel.py\", line 857, in torch_dynamo_resume_in_parallel_nsa_at_838\n o = o_slc = ParallelNSAFunction.apply(q, k, v, block_indices, block_counts, block_size, scale, cu_seqlens)\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py\", line 1432, in __call__\n return self._torchdynamo_orig_callable(\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py\", line 1213, in __call__\n result = self._inner_convert(\n ^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py\", line 598, in __call__\n return _compile(\n ^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py\", line 1110, in _compile\n raise InternalTorchDynamoError(\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py\", line 1059, in _compile\n guarded_code = compile_inner(code, one_graph, hooks, transform)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_utils_internal.py\", line 97, in wrapper_function\n return function(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py\", line 761, in compile_inner\n return _compile_inner(code, one_graph, hooks, transform)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py\", line 797, in _compile_inner\n out_code = transform_code_object(code, transform)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/bytecode_transformation.py\", line 1422, in transform_code_object\n transformations(instructions, code_options)\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py\", line 257, in _fn\n return fn(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py\", line 715, in transform\n tracer.run()\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py\", line 3498, in run\n super().run()\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py\", line 1337, in run\n while self.step():\n ^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py\", line 1246, in step\n self.dispatch_table[inst.opcode](self, inst)\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py\", line 2157, in COMPARE_OP\n self.push(compare_op_handlers[inst.argval](self, self.popn(2), {}))\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py\", line 1111, in call_function\n return handler(tx, args, kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py\", line 789, in <lambda>\n return lambda tx, args, kwargs: obj.call_function(\n ^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py\", line 1111, in call_function\n return handler(tx, args, kwargs)\n ^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py\", line 945, in builtin_dispatch\n rv = fn(tx, args, kwargs)\n ^^^^^^^^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py\", line 839, in call_binop_handlers\n rv = fn(tx, *args)\n ^^^^^^^^^^^^^\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py\", line 533, in compare_by_value\n return ConstantVariable(op(a.value, b.value))\n ^^^^^^^^^^^^^^^^^^^^\ntorch._dynamo.exc.InternalTorchDynamoError: TypeError: '>' not supported between instances of 'NoneType' and 'int'\n\nfrom user code:\n File \"/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/ops/nsa/parallel.py\", line 862, in torch_dynamo_resume_in_parallel_nsa_at_857\n if window_size > 0:\n\nSet TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS=\"+dynamo\"\n\n", "timestamp": "1753352474"}}}

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/0/stderr.log ADDED Viewed

	@@ -0,0 +1,573 @@

+[titan] 2025-07-24 18:19:03,237 - root - INFO - Starting job: default job
+[titan] 2025-07-24 18:19:03,237 - root - INFO - [32m{
+  "activation_checkpoint": {
+    "mode": "none",
+    "selective_ac_option": "2"
+  },
+  "activation_offload": {
+    "mode": "none"
+  },
+  "checkpoint": {
+    "async_mode": "disabled",
+    "create_seed_checkpoint": false,
+    "enable_checkpoint": true,
+    "exclude_from_loading": [],
+    "export_dtype": "bfloat16",
+    "folder": "checkpoint",
+    "interval": 8192,
+    "interval_type": "steps",
+    "keep_latest_k": 100,
+    "load_step": -1,
+    "model_weights_only": false
+  },
+  "comm": {
+    "init_timeout_seconds": 300,
+    "trace_buf_size": 20000,
+    "train_timeout_seconds": 100
+  },
+  "experimental": {
+    "context_parallel_degree": 1,
+    "context_parallel_rotate_method": "allgather",
+    "custom_model_path": "",
+    "enable_async_tensor_parallel": false,
+    "enable_compiled_autograd": false,
+    "pipeline_parallel_degree": 1,
+    "pipeline_parallel_microbatches": null,
+    "pipeline_parallel_schedule": "1F1B",
+    "pipeline_parallel_schedule_csv": "",
+    "pipeline_parallel_split_points": []
+  },
+  "fault_tolerance": {
+    "enable": false,
+    "group_size": 0,
+    "min_replica_size": 1,
+    "replica_id": 0
+  },
+  "float8": {
+    "enable_fsdp_float8_all_gather": false,
+    "force_recompute_fp8_weight_in_bwd": false,
+    "precompute_float8_dynamic_scale_for_fsdp": false,
+    "recipe_name": null
+  },
+  "job": {
+    "config_file": "flame/models/fla.toml",
+    "description": "default job",
+    "dump_folder": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2",
+    "print_args": true,
+    "use_for_integration_test": false
+  },
+  "lr_scheduler": {
+    "decay_ratio": 1.0,
+    "decay_type": "linear",
+    "lr_min": 0.01,
+    "warmup_steps": 100
+  },
+  "memory_estimation": {
+    "disable_fake_mode": false,
+    "enabled": false
+  },
+  "metrics": {
+    "disable_color_printing": false,
+    "enable_tensorboard": true,
+    "enable_wandb": true,
+    "log_freq": 1,
+    "save_for_all_ranks": false,
+    "save_tb_folder": "tb"
+  },
+  "model": {
+    "config": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/configs/gdn_6_nsa_1_340M.json",
+    "converters": [],
+    "name": "fla",
+    "print_after_conversion": false,
+    "tokenizer_path": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/tokenizer"
+  },
+  "optimizer": {
+    "early_step_in_backward": false,
+    "eps": 1e-08,
+    "implementation": "fused",
+    "lr": 0.0003,
+    "name": "AdamW"
+  },
+  "profiling": {
+    "enable_memory_snapshot": false,
+    "enable_profiling": true,
+    "profile_freq": 512,
+    "save_memory_snapshot_folder": "memory_snapshot",
+    "save_traces_folder": "profile_trace"
+  },
+  "training": {
+    "batch_size": 8,
+    "compile": true,
+    "context_len": 8192,
+    "data_dir": null,
+    "data_files": null,
+    "data_parallel_replicate_degree": 1,
+    "data_parallel_shard_degree": -1,
+    "data_probs": "0.55,0.3,0.15",
+    "dataset": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro",
+    "dataset_name": "default,default,default",
+    "dataset_split": "train,train,train",
+    "deterministic": false,
+    "disable_loss_parallel": false,
+    "enable_cpu_offload": false,
+    "fsdp_reshard_after_forward": "default",
+    "gc_freq": 50,
+    "gradient_accumulation_steps": 2,
+    "max_norm": 1.0,
+    "mixed_precision_param": "bfloat16",
+    "mixed_precision_reduce": "float32",
+    "num_workers": 32,
+    "persistent_workers": false,
+    "pin_memory": false,
+    "prefetch_factor": 2,
+    "seed": 42,
+    "seq_len": 8192,
+    "skip_nan_inf": true,
+    "steps": 95366,
+    "streaming": true,
+    "tensor_parallel_degree": 1,
+    "varlen": false
+  }
+}[39m
+[titan] 2025-07-24 18:19:03,238 - root - INFO - [GC] Initial GC collection. 0.00 seconds.
+[titan] 2025-07-24 18:19:03,238 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
+[titan] 2025-07-24 18:19:03,244 - root - INFO - CUDA capacity: NVIDIA H20 with 95.00GiB memory
+[titan] 2025-07-24 18:19:03,587 - root - WARNING - Peak flops undefined for: NVIDIA H20, fallback to A100
+[titan] 2025-07-24 18:19:03,587 - root - INFO - Peak FLOPS used for computing MFU: 3.120e+14
+[titan] 2025-07-24 18:19:03,588 - root - INFO - Building 1-D device mesh with ['dp_shard'], [8]
+[titan] 2025-07-24 18:19:04,524 - root - INFO - Loading tokenizer...
+[titan] 2025-07-24 18:19:04,641 - root - INFO - LlamaTokenizerFast(name_or_path='/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/tokenizer', vocab_size=32000, model_max_length=10000000000, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
+	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
+	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
+	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
+}
+)
+[titan] 2025-07-24 18:19:04,642 - root - INFO - Loading dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro:default,default,default
+[titan] 2025-07-24 18:19:04,833 - root - INFO - Subset [36m/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample:default (p = 0.550)[39m:
+IterableDataset({
+    features: ['text', 'id', 'dump', 'url', 'file_path', 'language', 'language_score', 'token_count', 'score', 'int_score'],
+    num_shards: 140
+})
+[titan] 2025-07-24 18:19:04,833 - root - INFO - Shuffling the dataset with seed 42
+[titan] 2025-07-24 18:19:04,833 - root - WARNING - [31mDataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample has insufficient shards (140). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.[39m
+[titan] 2025-07-24 18:19:55,836 - root - INFO - Subset [36m/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged:default (p = 0.300)[39m:
+IterableDataset({
+    features: ['repo', 'content'],
+    num_shards: 1
+})
+[titan] 2025-07-24 18:19:55,836 - root - INFO - Shuffling the dataset with seed 42
+[titan] 2025-07-24 18:19:55,836 - root - WARNING - [31mDataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged has insufficient shards (1). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.[39m
+[titan] 2025-07-24 18:19:56,175 - root - INFO - Subset [36m/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro:default (p = 0.150)[39m:
+IterableDataset({
+    features: ['text', 'cc-path', 'domain', 'lang', 'lang_score', 'timestamp', 'url', 'math_score'],
+    num_shards: 100
+})
+[titan] 2025-07-24 18:19:56,175 - root - INFO - Shuffling the dataset with seed 42
+[titan] 2025-07-24 18:19:56,175 - root - WARNING - [31mDataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro has insufficient shards (100). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.[39m
+[titan] 2025-07-24 18:20:02,301 - root - INFO - Interleaving 3 datasets with probabilities [0.55, 0.3, 0.15]
+[titan] 2025-07-24 18:20:02,989 - root - INFO - IterableDataset({
+    features: ['text', 'content'],
+    num_shards: 256
+})
+[titan] 2025-07-24 18:20:03,108 - root - INFO - Building dataloader...
+[titan] 2025-07-24 18:20:03,110 - root - INFO - Loading model config from /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/configs/gdn_6_nsa_1_340M.json
+[titan] 2025-07-24 18:20:03,112 - root - INFO - Building model from the config
+[32mGatedDeltaNetConfig {
+  "allow_neg_eigval": false,
+  "architectures": [
+    "GatedDeltaNetForCausalLM"
+  ],
+  "attn": {
+    "block_counts": 16,
+    "block_size": 64,
+    "layers": [
+      5,
+      11,
+      17,
+      23
+    ],
+    "num_heads": 32,
+    "num_kv_heads": 2,
+    "qkv_bias": false,
+    "rope_theta": 160000.0,
+    "type": "nsa",
+    "window_size": null
+  },
+  "attn_mode": "chunk",
+  "bos_token_id": 1,
+  "conv_size": 4,
+  "eos_token_id": 2,
+  "expand_k": 1,
+  "expand_v": 1,
+  "fuse_cross_entropy": true,
+  "fuse_norm": true,
+  "fuse_swiglu": true,
+  "head_dim": 256,
+  "hidden_act": "swish",
+  "hidden_ratio": 4,
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "intermediate_size": null,
+  "max_position_embeddings": 8192,
+  "model_type": "gated_deltanet",
+  "norm_eps": 1e-06,
+  "norm_first": false,
+  "num_heads": 4,
+  "num_hidden_layers": 24,
+  "num_v_heads": null,
+  "qk_activation": "silu",
+  "qk_norm": "l2",
+  "tie_word_embeddings": false,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.53.3",
+  "use_beta": true,
+  "use_cache": true,
+  "use_gate": true,
+  "use_l2warp": false,
+  "use_output_norm": true,
+  "use_short_conv": true,
+  "vocab_size": 32000
+}
+[39m
+[titan] 2025-07-24 18:20:03,455 - root - INFO - [34m
+GatedDeltaNetForCausalLM(
+  (model): GatedDeltaNetModel(
+    (embeddings): Embedding(32000, 1024)
+    (layers): ModuleList(
+      (0-4): 5 x GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): GatedDeltaNet(
+          (q_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (a_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (b_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (q_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (k_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (v_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (g_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (o_norm): FusedRMSNormGated(256, eps=1e-06, activation=swish)
+          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (5): GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): NativeSparseAttention(
+          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (g_proj): Linear(in_features=1024, out_features=96, bias=False)
+          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
+          (rotary): RotaryEmbedding(dim=64, base=160000.0, interleaved=False, pos_idx_in_fp32=True)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (6-10): 5 x GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): GatedDeltaNet(
+          (q_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (a_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (b_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (q_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (k_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (v_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (g_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (o_norm): FusedRMSNormGated(256, eps=1e-06, activation=swish)
+          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (11): GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): NativeSparseAttention(
+          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (g_proj): Linear(in_features=1024, out_features=96, bias=False)
+          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
+          (rotary): RotaryEmbedding(dim=64, base=160000.0, interleaved=False, pos_idx_in_fp32=True)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (12-16): 5 x GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): GatedDeltaNet(
+          (q_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (a_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (b_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (q_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (k_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (v_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (g_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (o_norm): FusedRMSNormGated(256, eps=1e-06, activation=swish)
+          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (17): GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): NativeSparseAttention(
+          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (g_proj): Linear(in_features=1024, out_features=96, bias=False)
+          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
+          (rotary): RotaryEmbedding(dim=64, base=160000.0, interleaved=False, pos_idx_in_fp32=True)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (18-22): 5 x GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): GatedDeltaNet(
+          (q_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (a_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (b_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (q_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (k_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (v_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (g_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (o_norm): FusedRMSNormGated(256, eps=1e-06, activation=swish)
+          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (23): GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): NativeSparseAttention(
+          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (g_proj): Linear(in_features=1024, out_features=96, bias=False)
+          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
+          (rotary): RotaryEmbedding(dim=64, base=160000.0, interleaved=False, pos_idx_in_fp32=True)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+    )
+    (norm): RMSNorm(1024, eps=1e-06)
+  )
+  (lm_head): Linear(in_features=1024, out_features=32000, bias=False)
+  (criterion): FusedLinearCrossEntropyLoss()
+)[39m
+[titan] 2025-07-24 18:20:03,490 - root - INFO - Compiling each block with torch.compile
+[titan] 2025-07-24 18:20:03,491 - root - INFO - Compiling the embedding, norm, and lm_head layers with torch.compile
+[titan] 2025-07-24 18:20:03,491 - root - INFO - Compiling the entire model with torch.compile
+[titan] 2025-07-24 18:20:03,573 - root - INFO - Applied FSDP to the model
+[titan] 2025-07-24 18:20:04,119 - fla.models.gated_deltanet.modeling_gated_deltanet - WARNING - `A_log` is a DTensor, skipping initialization
+[titan] 2025-07-24 18:20:04,120 - fla.models.gated_deltanet.modeling_gated_deltanet - WARNING - `dt_bias` is a DTensor, skipping initialization
+[titan] 2025-07-24 18:20:04,215 - root - INFO - CUDA memory usage for model: 0.10GiB(0.10%)
+[titan] 2025-07-24 18:20:04,217 - root - WARNING - Warmup (100) + decay (95366) steps exceed total training steps (95366). Adjusting decay steps to 95266.
+[titan] 2025-07-24 18:20:04,240 - root - INFO - Checkpointing active. Checkpoints will be loaded from and saved to /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/checkpoint
+[titan] 2025-07-24 18:20:05,365 - root - ERROR - Failed to create WandB logger: api_key not configured (no-tty). call wandb.login(key=[your_api_key])
+[titan] 2025-07-24 18:20:05,376 - root - INFO - TensorBoard logging enabled. Logs will be saved at /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/tb/20250724-1820
+[titan] 2025-07-24 18:20:05,376 - root - INFO - CUDA capacity: NVIDIA H20 with 95.00GiB memory
+[titan] 2025-07-24 18:20:05,441 - root - WARNING - Peak flops undefined for: NVIDIA H20, fallback to A100
+[titan] 2025-07-24 18:20:11,953 - root - INFO - [31m***** Running training *****[39m
+[titan] 2025-07-24 18:20:11,954 - root - INFO - [32m  Training starts at step 1
+[titan] 2025-07-24 18:20:11,956 - root - INFO - [32m  Number of tokens per sequence = 8,192
+[titan] 2025-07-24 18:20:11,956 - root - INFO - [32m  Gradient Accumulation steps = 2
+[titan] 2025-07-24 18:20:11,957 - root - INFO - [32m  Instantaneous batch size (per device) = 8
+[titan] 2025-07-24 18:20:11,957 - root - INFO - [32m  Global batch size (w. parallel, distributed & accumulation) = 128 (1,048,576 tokens)
+[titan] 2025-07-24 18:20:11,957 - root - INFO - [32m  Total optimization steps = 95,366 (99,998,498,816 tokens)
+[titan] 2025-07-24 18:20:11,957 - root - INFO - [32m  Warmup steps = 100 (104,857,600 tokens)
+[titan] 2025-07-24 18:20:11,957 - root - INFO - [32m  Number of parameters = 396,695,712 [39m
+[titan] 2025-07-24 18:20:11,957 - root - INFO - Profiling active. Traces will be saved at /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/profile_trace
+/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/functions.py:1263: UserWarning: Dynamo does not know how to trace the builtin `cuda_utils.get_device_properties.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
+If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
+If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
+  torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
+[rank0]: Traceback (most recent call last):
+[rank0]:   File "<frozen runpy>", line 198, in _run_module_as_main
+[rank0]:   File "<frozen runpy>", line 88, in _run_code
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py", line 616, in <module>
+[rank0]:     main(config)
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
+[rank0]:     return f(*args, **kwargs)
+[rank0]:            ^^^^^^^^^^^^^^^^^^
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py", line 488, in main
+[rank0]:     output = model(
+[rank0]:              ^^^^^^
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
+[rank0]:     return self._call_impl(*args, **kwargs)
+[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
+[rank0]:     return inner()
+[rank0]:            ^^^^^^^
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1805, in inner
+[rank0]:     result = forward_call(*args, **kwargs)
+[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
+[rank0]:     return func(*args, **kwargs)
+[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/gated_deltanet/modeling_gated_deltanet.py", line 424, in forward
+[rank0]:     outputs = self.model(
+[rank0]:               ^^^^^^^^^^^
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
+[rank0]:     return self._call_impl(*args, **kwargs)
+[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
+[rank0]:     return forward_call(*args, **kwargs)
+[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/gated_deltanet/modeling_gated_deltanet.py", line 294, in forward
+[rank0]:     hidden_states, attentions, past_key_values = layer(
+[rank0]:                                                  ^^^^^^
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
+[rank0]:     return self._call_impl(*args, **kwargs)
+[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
+[rank0]:     return inner()
+[rank0]:            ^^^^^^^
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1805, in inner
+[rank0]:     result = forward_call(*args, **kwargs)
+[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 655, in _fn
+[rank0]:     return fn(*args, **kwargs)
+[rank0]:            ^^^^^^^^^^^^^^^^^^^
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
+[rank0]:     return self._call_impl(*args, **kwargs)
+[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
+[rank0]:     return forward_call(*args, **kwargs)
+[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/gated_deltanet/modeling_gated_deltanet.py", line 108, in forward
+[rank0]:     hidden_states = self.attn_norm(hidden_states)
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/gated_deltanet/modeling_gated_deltanet.py", line 109, in torch_dynamo_resume_in_forward_at_108
+[rank0]:     hidden_states, attentions, past_key_values = self.attn(
+[rank0]:                                                  ^^^^^^^^^^
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
+[rank0]:     return self._call_impl(*args, **kwargs)
+[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
+[rank0]:     return forward_call(*args, **kwargs)
+[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/nsa.py", line 108, in forward
+[rank0]:     q, k = self.rotary(q, k, seqlen_offset=seqlen_offset, max_seqlen=max_seqlen, cu_seqlens=cu_seqlens)
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/nsa.py", line 123, in torch_dynamo_resume_in_forward_at_108
+[rank0]:     o = parallel_nsa(
+[rank0]:         ^^^^^^^^^^^^^
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/ops/nsa/parallel.py", line 838, in parallel_nsa
+[rank0]:     o_cmp, lse_cmp = parallel_nsa_compression(
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/ops/nsa/parallel.py", line 857, in torch_dynamo_resume_in_parallel_nsa_at_838
+[rank0]:     o = o_slc = ParallelNSAFunction.apply(q, k, v, block_indices, block_counts, block_size, scale, cu_seqlens)
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 1432, in __call__
+[rank0]:     return self._torchdynamo_orig_callable(
+[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 1213, in __call__
+[rank0]:     result = self._inner_convert(
+[rank0]:              ^^^^^^^^^^^^^^^^^^^^
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 598, in __call__
+[rank0]:     return _compile(
+[rank0]:            ^^^^^^^^^
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 1110, in _compile
+[rank0]:     raise InternalTorchDynamoError(
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 1059, in _compile
+[rank0]:     guarded_code = compile_inner(code, one_graph, hooks, transform)
+[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_utils_internal.py", line 97, in wrapper_function
+[rank0]:     return function(*args, **kwargs)
+[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 761, in compile_inner
+[rank0]:     return _compile_inner(code, one_graph, hooks, transform)
+[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 797, in _compile_inner
+[rank0]:     out_code = transform_code_object(code, transform)
+[rank0]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/bytecode_transformation.py", line 1422, in transform_code_object
+[rank0]:     transformations(instructions, code_options)
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 257, in _fn
+[rank0]:     return fn(*args, **kwargs)
+[rank0]:            ^^^^^^^^^^^^^^^^^^^
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 715, in transform
+[rank0]:     tracer.run()
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 3498, in run
+[rank0]:     super().run()
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 1337, in run
+[rank0]:     while self.step():
+[rank0]:           ^^^^^^^^^^^
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 1246, in step
+[rank0]:     self.dispatch_table[inst.opcode](self, inst)
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 2157, in COMPARE_OP
+[rank0]:     self.push(compare_op_handlers[inst.argval](self, self.popn(2), {}))
+[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 1111, in call_function
+[rank0]:     return handler(tx, args, kwargs)
+[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 789, in <lambda>
+[rank0]:     return lambda tx, args, kwargs: obj.call_function(
+[rank0]:                                     ^^^^^^^^^^^^^^^^^^
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 1111, in call_function
+[rank0]:     return handler(tx, args, kwargs)
+[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 945, in builtin_dispatch
+[rank0]:     rv = fn(tx, args, kwargs)
+[rank0]:          ^^^^^^^^^^^^^^^^^^^^
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 839, in call_binop_handlers
+[rank0]:     rv = fn(tx, *args)
+[rank0]:          ^^^^^^^^^^^^^
+[rank0]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 533, in compare_by_value
+[rank0]:     return ConstantVariable(op(a.value, b.value))
+[rank0]:                             ^^^^^^^^^^^^^^^^^^^^
+[rank0]: torch._dynamo.exc.InternalTorchDynamoError: TypeError: '>' not supported between instances of 'NoneType' and 'int'
+[rank0]: from user code:
+[rank0]:    File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/ops/nsa/parallel.py", line 862, in torch_dynamo_resume_in_parallel_nsa_at_857
+[rank0]:     if window_size > 0:
+[rank0]: Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/0/stdout.log ADDED Viewed

File without changes

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/1/error.json ADDED Viewed

	@@ -0,0 +1 @@

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/1/stderr.log ADDED Viewed

	@@ -0,0 +1,571 @@

+[titan] 2025-07-24 18:19:06,932 - root - INFO - Starting job: default job
+[titan] 2025-07-24 18:19:06,933 - root - INFO - [32m{
+  "activation_checkpoint": {
+    "mode": "none",
+    "selective_ac_option": "2"
+  },
+  "activation_offload": {
+    "mode": "none"
+  },
+  "checkpoint": {
+    "async_mode": "disabled",
+    "create_seed_checkpoint": false,
+    "enable_checkpoint": true,
+    "exclude_from_loading": [],
+    "export_dtype": "bfloat16",
+    "folder": "checkpoint",
+    "interval": 8192,
+    "interval_type": "steps",
+    "keep_latest_k": 100,
+    "load_step": -1,
+    "model_weights_only": false
+  },
+  "comm": {
+    "init_timeout_seconds": 300,
+    "trace_buf_size": 20000,
+    "train_timeout_seconds": 100
+  },
+  "experimental": {
+    "context_parallel_degree": 1,
+    "context_parallel_rotate_method": "allgather",
+    "custom_model_path": "",
+    "enable_async_tensor_parallel": false,
+    "enable_compiled_autograd": false,
+    "pipeline_parallel_degree": 1,
+    "pipeline_parallel_microbatches": null,
+    "pipeline_parallel_schedule": "1F1B",
+    "pipeline_parallel_schedule_csv": "",
+    "pipeline_parallel_split_points": []
+  },
+  "fault_tolerance": {
+    "enable": false,
+    "group_size": 0,
+    "min_replica_size": 1,
+    "replica_id": 0
+  },
+  "float8": {
+    "enable_fsdp_float8_all_gather": false,
+    "force_recompute_fp8_weight_in_bwd": false,
+    "precompute_float8_dynamic_scale_for_fsdp": false,
+    "recipe_name": null
+  },
+  "job": {
+    "config_file": "flame/models/fla.toml",
+    "description": "default job",
+    "dump_folder": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2",
+    "print_args": true,
+    "use_for_integration_test": false
+  },
+  "lr_scheduler": {
+    "decay_ratio": 1.0,
+    "decay_type": "linear",
+    "lr_min": 0.01,
+    "warmup_steps": 100
+  },
+  "memory_estimation": {
+    "disable_fake_mode": false,
+    "enabled": false
+  },
+  "metrics": {
+    "disable_color_printing": false,
+    "enable_tensorboard": true,
+    "enable_wandb": true,
+    "log_freq": 1,
+    "save_for_all_ranks": false,
+    "save_tb_folder": "tb"
+  },
+  "model": {
+    "config": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/configs/gdn_6_nsa_1_340M.json",
+    "converters": [],
+    "name": "fla",
+    "print_after_conversion": false,
+    "tokenizer_path": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/tokenizer"
+  },
+  "optimizer": {
+    "early_step_in_backward": false,
+    "eps": 1e-08,
+    "implementation": "fused",
+    "lr": 0.0003,
+    "name": "AdamW"
+  },
+  "profiling": {
+    "enable_memory_snapshot": false,
+    "enable_profiling": true,
+    "profile_freq": 512,
+    "save_memory_snapshot_folder": "memory_snapshot",
+    "save_traces_folder": "profile_trace"
+  },
+  "training": {
+    "batch_size": 8,
+    "compile": true,
+    "context_len": 8192,
+    "data_dir": null,
+    "data_files": null,
+    "data_parallel_replicate_degree": 1,
+    "data_parallel_shard_degree": -1,
+    "data_probs": "0.55,0.3,0.15",
+    "dataset": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro",
+    "dataset_name": "default,default,default",
+    "dataset_split": "train,train,train",
+    "deterministic": false,
+    "disable_loss_parallel": false,
+    "enable_cpu_offload": false,
+    "fsdp_reshard_after_forward": "default",
+    "gc_freq": 50,
+    "gradient_accumulation_steps": 2,
+    "max_norm": 1.0,
+    "mixed_precision_param": "bfloat16",
+    "mixed_precision_reduce": "float32",
+    "num_workers": 32,
+    "persistent_workers": false,
+    "pin_memory": false,
+    "prefetch_factor": 2,
+    "seed": 42,
+    "seq_len": 8192,
+    "skip_nan_inf": true,
+    "steps": 95366,
+    "streaming": true,
+    "tensor_parallel_degree": 1,
+    "varlen": false
+  }
+}[39m
+[titan] 2025-07-24 18:19:06,933 - root - INFO - [GC] Initial GC collection. 0.00 seconds.
+[titan] 2025-07-24 18:19:07,405 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
+[titan] 2025-07-24 18:19:07,407 - root - INFO - CUDA capacity: NVIDIA H20 with 95.00GiB memory
+[titan] 2025-07-24 18:19:07,464 - root - WARNING - Peak flops undefined for: NVIDIA H20, fallback to A100
+[titan] 2025-07-24 18:19:07,465 - root - INFO - Peak FLOPS used for computing MFU: 3.120e+14
+[titan] 2025-07-24 18:19:07,465 - root - INFO - Building 1-D device mesh with ['dp_shard'], [8]
+[titan] 2025-07-24 18:19:07,519 - root - INFO - Loading tokenizer...
+[titan] 2025-07-24 18:19:07,629 - root - INFO - LlamaTokenizerFast(name_or_path='/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/tokenizer', vocab_size=32000, model_max_length=10000000000, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
+	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
+	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
+	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
+}
+)
+[titan] 2025-07-24 18:19:07,629 - root - INFO - Loading dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro:default,default,default
+[titan] 2025-07-24 18:19:07,764 - root - INFO - Subset [36m/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample:default (p = 0.550)[39m:
+IterableDataset({
+    features: ['text', 'id', 'dump', 'url', 'file_path', 'language', 'language_score', 'token_count', 'score', 'int_score'],
+    num_shards: 140
+})
+[titan] 2025-07-24 18:19:07,764 - root - INFO - Shuffling the dataset with seed 42
+[titan] 2025-07-24 18:19:07,764 - root - WARNING - [31mDataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample has insufficient shards (140). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.[39m
+[titan] 2025-07-24 18:20:12,516 - root - INFO - Subset [36m/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged:default (p = 0.300)[39m:
+IterableDataset({
+    features: ['repo', 'content'],
+    num_shards: 1
+})
+[titan] 2025-07-24 18:20:12,516 - root - INFO - Shuffling the dataset with seed 42
+[titan] 2025-07-24 18:20:12,516 - root - WARNING - [31mDataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged has insufficient shards (1). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.[39m
+[titan] 2025-07-24 18:20:12,876 - root - INFO - Subset [36m/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro:default (p = 0.150)[39m:
+IterableDataset({
+    features: ['text', 'cc-path', 'domain', 'lang', 'lang_score', 'timestamp', 'url', 'math_score'],
+    num_shards: 100
+})
+[titan] 2025-07-24 18:20:12,877 - root - INFO - Shuffling the dataset with seed 42
+[titan] 2025-07-24 18:20:12,877 - root - WARNING - [31mDataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro has insufficient shards (100). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.[39m
+[titan] 2025-07-24 18:20:19,318 - root - INFO - Interleaving 3 datasets with probabilities [0.55, 0.3, 0.15]
+[titan] 2025-07-24 18:20:20,040 - root - INFO - IterableDataset({
+    features: ['text', 'content'],
+    num_shards: 256
+})
+[titan] 2025-07-24 18:20:20,162 - root - INFO - Building dataloader...
+[titan] 2025-07-24 18:20:20,164 - root - INFO - Loading model config from /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/configs/gdn_6_nsa_1_340M.json
+[titan] 2025-07-24 18:20:20,166 - root - INFO - Building model from the config
+[32mGatedDeltaNetConfig {
+  "allow_neg_eigval": false,
+  "architectures": [
+    "GatedDeltaNetForCausalLM"
+  ],
+  "attn": {
+    "block_counts": 16,
+    "block_size": 64,
+    "layers": [
+      5,
+      11,
+      17,
+      23
+    ],
+    "num_heads": 32,
+    "num_kv_heads": 2,
+    "qkv_bias": false,
+    "rope_theta": 160000.0,
+    "type": "nsa",
+    "window_size": null
+  },
+  "attn_mode": "chunk",
+  "bos_token_id": 1,
+  "conv_size": 4,
+  "eos_token_id": 2,
+  "expand_k": 1,
+  "expand_v": 1,
+  "fuse_cross_entropy": true,
+  "fuse_norm": true,
+  "fuse_swiglu": true,
+  "head_dim": 256,
+  "hidden_act": "swish",
+  "hidden_ratio": 4,
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "intermediate_size": null,
+  "max_position_embeddings": 8192,
+  "model_type": "gated_deltanet",
+  "norm_eps": 1e-06,
+  "norm_first": false,
+  "num_heads": 4,
+  "num_hidden_layers": 24,
+  "num_v_heads": null,
+  "qk_activation": "silu",
+  "qk_norm": "l2",
+  "tie_word_embeddings": false,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.53.3",
+  "use_beta": true,
+  "use_cache": true,
+  "use_gate": true,
+  "use_l2warp": false,
+  "use_output_norm": true,
+  "use_short_conv": true,
+  "vocab_size": 32000
+}
+[39m
+[titan] 2025-07-24 18:20:20,496 - root - INFO - [34m
+GatedDeltaNetForCausalLM(
+  (model): GatedDeltaNetModel(
+    (embeddings): Embedding(32000, 1024)
+    (layers): ModuleList(
+      (0-4): 5 x GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): GatedDeltaNet(
+          (q_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (a_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (b_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (q_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (k_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (v_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (g_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (o_norm): FusedRMSNormGated(256, eps=1e-06, activation=swish)
+          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (5): GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): NativeSparseAttention(
+          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (g_proj): Linear(in_features=1024, out_features=96, bias=False)
+          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
+          (rotary): RotaryEmbedding(dim=64, base=160000.0, interleaved=False, pos_idx_in_fp32=True)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (6-10): 5 x GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): GatedDeltaNet(
+          (q_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (a_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (b_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (q_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (k_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (v_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (g_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (o_norm): FusedRMSNormGated(256, eps=1e-06, activation=swish)
+          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (11): GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): NativeSparseAttention(
+          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (g_proj): Linear(in_features=1024, out_features=96, bias=False)
+          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
+          (rotary): RotaryEmbedding(dim=64, base=160000.0, interleaved=False, pos_idx_in_fp32=True)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (12-16): 5 x GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): GatedDeltaNet(
+          (q_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (a_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (b_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (q_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (k_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (v_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (g_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (o_norm): FusedRMSNormGated(256, eps=1e-06, activation=swish)
+          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (17): GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): NativeSparseAttention(
+          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (g_proj): Linear(in_features=1024, out_features=96, bias=False)
+          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
+          (rotary): RotaryEmbedding(dim=64, base=160000.0, interleaved=False, pos_idx_in_fp32=True)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (18-22): 5 x GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): GatedDeltaNet(
+          (q_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (a_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (b_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (q_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (k_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (v_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (g_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (o_norm): FusedRMSNormGated(256, eps=1e-06, activation=swish)
+          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (23): GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): NativeSparseAttention(
+          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (g_proj): Linear(in_features=1024, out_features=96, bias=False)
+          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
+          (rotary): RotaryEmbedding(dim=64, base=160000.0, interleaved=False, pos_idx_in_fp32=True)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+    )
+    (norm): RMSNorm(1024, eps=1e-06)
+  )
+  (lm_head): Linear(in_features=1024, out_features=32000, bias=False)
+  (criterion): FusedLinearCrossEntropyLoss()
+)[39m
+[titan] 2025-07-24 18:20:20,532 - root - INFO - Compiling each block with torch.compile
+[titan] 2025-07-24 18:20:20,532 - root - INFO - Compiling the embedding, norm, and lm_head layers with torch.compile
+[titan] 2025-07-24 18:20:20,533 - root - INFO - Compiling the entire model with torch.compile
+[titan] 2025-07-24 18:20:20,617 - root - INFO - Applied FSDP to the model
+[titan] 2025-07-24 18:20:20,680 - fla.models.gated_deltanet.modeling_gated_deltanet - WARNING - `A_log` is a DTensor, skipping initialization
+[titan] 2025-07-24 18:20:20,680 - fla.models.gated_deltanet.modeling_gated_deltanet - WARNING - `dt_bias` is a DTensor, skipping initialization
+[titan] 2025-07-24 18:20:20,776 - root - INFO - CUDA memory usage for model: 0.10GiB(0.10%)
+[titan] 2025-07-24 18:20:20,778 - root - WARNING - Warmup (100) + decay (95366) steps exceed total training steps (95366). Adjusting decay steps to 95266.
+[titan] 2025-07-24 18:20:20,802 - root - INFO - Checkpointing active. Checkpoints will be loaded from and saved to /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/checkpoint
+[titan] 2025-07-24 18:20:20,803 - root - INFO - CUDA capacity: NVIDIA H20 with 95.00GiB memory
+[titan] 2025-07-24 18:20:20,860 - root - WARNING - Peak flops undefined for: NVIDIA H20, fallback to A100
+[titan] 2025-07-24 18:20:29,317 - root - INFO - [31m***** Running training *****[39m
+[titan] 2025-07-24 18:20:29,347 - root - INFO - [32m  Training starts at step 1
+[titan] 2025-07-24 18:20:29,349 - root - INFO - [32m  Number of tokens per sequence = 8,192
+[titan] 2025-07-24 18:20:29,350 - root - INFO - [32m  Gradient Accumulation steps = 2
+[titan] 2025-07-24 18:20:29,350 - root - INFO - [32m  Instantaneous batch size (per device) = 8
+[titan] 2025-07-24 18:20:29,350 - root - INFO - [32m  Global batch size (w. parallel, distributed & accumulation) = 128 (1,048,576 tokens)
+[titan] 2025-07-24 18:20:29,350 - root - INFO - [32m  Total optimization steps = 95,366 (99,998,498,816 tokens)
+[titan] 2025-07-24 18:20:29,350 - root - INFO - [32m  Warmup steps = 100 (104,857,600 tokens)
+[titan] 2025-07-24 18:20:29,350 - root - INFO - [32m  Number of parameters = 396,695,712 [39m
+[titan] 2025-07-24 18:20:29,351 - root - INFO - Profiling active. Traces will be saved at /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/profile_trace
+/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/functions.py:1263: UserWarning: Dynamo does not know how to trace the builtin `cuda_utils.get_device_properties.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
+If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
+If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
+  torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
+[rank1]: Traceback (most recent call last):
+[rank1]:   File "<frozen runpy>", line 198, in _run_module_as_main
+[rank1]:   File "<frozen runpy>", line 88, in _run_code
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py", line 616, in <module>
+[rank1]:     main(config)
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
+[rank1]:     return f(*args, **kwargs)
+[rank1]:            ^^^^^^^^^^^^^^^^^^
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py", line 488, in main
+[rank1]:     output = model(
+[rank1]:              ^^^^^^
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
+[rank1]:     return self._call_impl(*args, **kwargs)
+[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
+[rank1]:     return inner()
+[rank1]:            ^^^^^^^
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1805, in inner
+[rank1]:     result = forward_call(*args, **kwargs)
+[rank1]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
+[rank1]:     return func(*args, **kwargs)
+[rank1]:            ^^^^^^^^^^^^^^^^^^^^^
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/gated_deltanet/modeling_gated_deltanet.py", line 424, in forward
+[rank1]:     outputs = self.model(
+[rank1]:               ^^^^^^^^^^^
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
+[rank1]:     return self._call_impl(*args, **kwargs)
+[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
+[rank1]:     return forward_call(*args, **kwargs)
+[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/gated_deltanet/modeling_gated_deltanet.py", line 294, in forward
+[rank1]:     hidden_states, attentions, past_key_values = layer(
+[rank1]:                                                  ^^^^^^
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
+[rank1]:     return self._call_impl(*args, **kwargs)
+[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
+[rank1]:     return inner()
+[rank1]:            ^^^^^^^
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1805, in inner
+[rank1]:     result = forward_call(*args, **kwargs)
+[rank1]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 655, in _fn
+[rank1]:     return fn(*args, **kwargs)
+[rank1]:            ^^^^^^^^^^^^^^^^^^^
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
+[rank1]:     return self._call_impl(*args, **kwargs)
+[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
+[rank1]:     return forward_call(*args, **kwargs)
+[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/gated_deltanet/modeling_gated_deltanet.py", line 108, in forward
+[rank1]:     hidden_states = self.attn_norm(hidden_states)
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/gated_deltanet/modeling_gated_deltanet.py", line 109, in torch_dynamo_resume_in_forward_at_108
+[rank1]:     hidden_states, attentions, past_key_values = self.attn(
+[rank1]:                                                  ^^^^^^^^^^
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
+[rank1]:     return self._call_impl(*args, **kwargs)
+[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
+[rank1]:     return forward_call(*args, **kwargs)
+[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/nsa.py", line 108, in forward
+[rank1]:     q, k = self.rotary(q, k, seqlen_offset=seqlen_offset, max_seqlen=max_seqlen, cu_seqlens=cu_seqlens)
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/nsa.py", line 123, in torch_dynamo_resume_in_forward_at_108
+[rank1]:     o = parallel_nsa(
+[rank1]:         ^^^^^^^^^^^^^
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/ops/nsa/parallel.py", line 838, in parallel_nsa
+[rank1]:     o_cmp, lse_cmp = parallel_nsa_compression(
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/ops/nsa/parallel.py", line 857, in torch_dynamo_resume_in_parallel_nsa_at_838
+[rank1]:     o = o_slc = ParallelNSAFunction.apply(q, k, v, block_indices, block_counts, block_size, scale, cu_seqlens)
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 1432, in __call__
+[rank1]:     return self._torchdynamo_orig_callable(
+[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 1213, in __call__
+[rank1]:     result = self._inner_convert(
+[rank1]:              ^^^^^^^^^^^^^^^^^^^^
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 598, in __call__
+[rank1]:     return _compile(
+[rank1]:            ^^^^^^^^^
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 1110, in _compile
+[rank1]:     raise InternalTorchDynamoError(
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 1059, in _compile
+[rank1]:     guarded_code = compile_inner(code, one_graph, hooks, transform)
+[rank1]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_utils_internal.py", line 97, in wrapper_function
+[rank1]:     return function(*args, **kwargs)
+[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 761, in compile_inner
+[rank1]:     return _compile_inner(code, one_graph, hooks, transform)
+[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 797, in _compile_inner
+[rank1]:     out_code = transform_code_object(code, transform)
+[rank1]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/bytecode_transformation.py", line 1422, in transform_code_object
+[rank1]:     transformations(instructions, code_options)
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 257, in _fn
+[rank1]:     return fn(*args, **kwargs)
+[rank1]:            ^^^^^^^^^^^^^^^^^^^
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 715, in transform
+[rank1]:     tracer.run()
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 3498, in run
+[rank1]:     super().run()
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 1337, in run
+[rank1]:     while self.step():
+[rank1]:           ^^^^^^^^^^^
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 1246, in step
+[rank1]:     self.dispatch_table[inst.opcode](self, inst)
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 2157, in COMPARE_OP
+[rank1]:     self.push(compare_op_handlers[inst.argval](self, self.popn(2), {}))
+[rank1]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 1111, in call_function
+[rank1]:     return handler(tx, args, kwargs)
+[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 789, in <lambda>
+[rank1]:     return lambda tx, args, kwargs: obj.call_function(
+[rank1]:                                     ^^^^^^^^^^^^^^^^^^
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 1111, in call_function
+[rank1]:     return handler(tx, args, kwargs)
+[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 945, in builtin_dispatch
+[rank1]:     rv = fn(tx, args, kwargs)
+[rank1]:          ^^^^^^^^^^^^^^^^^^^^
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 839, in call_binop_handlers
+[rank1]:     rv = fn(tx, *args)
+[rank1]:          ^^^^^^^^^^^^^
+[rank1]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 533, in compare_by_value
+[rank1]:     return ConstantVariable(op(a.value, b.value))
+[rank1]:                             ^^^^^^^^^^^^^^^^^^^^
+[rank1]: torch._dynamo.exc.InternalTorchDynamoError: TypeError: '>' not supported between instances of 'NoneType' and 'int'
+[rank1]: from user code:
+[rank1]:    File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/ops/nsa/parallel.py", line 862, in torch_dynamo_resume_in_parallel_nsa_at_857
+[rank1]:     if window_size > 0:
+[rank1]: Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/1/stdout.log ADDED Viewed

File without changes

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/2/error.json ADDED Viewed

	@@ -0,0 +1 @@

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/2/stderr.log ADDED Viewed

	@@ -0,0 +1,571 @@

+[titan] 2025-07-24 18:19:03,183 - root - INFO - Starting job: default job
+[titan] 2025-07-24 18:19:03,188 - root - INFO - [32m{
+  "activation_checkpoint": {
+    "mode": "none",
+    "selective_ac_option": "2"
+  },
+  "activation_offload": {
+    "mode": "none"
+  },
+  "checkpoint": {
+    "async_mode": "disabled",
+    "create_seed_checkpoint": false,
+    "enable_checkpoint": true,
+    "exclude_from_loading": [],
+    "export_dtype": "bfloat16",
+    "folder": "checkpoint",
+    "interval": 8192,
+    "interval_type": "steps",
+    "keep_latest_k": 100,
+    "load_step": -1,
+    "model_weights_only": false
+  },
+  "comm": {
+    "init_timeout_seconds": 300,
+    "trace_buf_size": 20000,
+    "train_timeout_seconds": 100
+  },
+  "experimental": {
+    "context_parallel_degree": 1,
+    "context_parallel_rotate_method": "allgather",
+    "custom_model_path": "",
+    "enable_async_tensor_parallel": false,
+    "enable_compiled_autograd": false,
+    "pipeline_parallel_degree": 1,
+    "pipeline_parallel_microbatches": null,
+    "pipeline_parallel_schedule": "1F1B",
+    "pipeline_parallel_schedule_csv": "",
+    "pipeline_parallel_split_points": []
+  },
+  "fault_tolerance": {
+    "enable": false,
+    "group_size": 0,
+    "min_replica_size": 1,
+    "replica_id": 0
+  },
+  "float8": {
+    "enable_fsdp_float8_all_gather": false,
+    "force_recompute_fp8_weight_in_bwd": false,
+    "precompute_float8_dynamic_scale_for_fsdp": false,
+    "recipe_name": null
+  },
+  "job": {
+    "config_file": "flame/models/fla.toml",
+    "description": "default job",
+    "dump_folder": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2",
+    "print_args": true,
+    "use_for_integration_test": false
+  },
+  "lr_scheduler": {
+    "decay_ratio": 1.0,
+    "decay_type": "linear",
+    "lr_min": 0.01,
+    "warmup_steps": 100
+  },
+  "memory_estimation": {
+    "disable_fake_mode": false,
+    "enabled": false
+  },
+  "metrics": {
+    "disable_color_printing": false,
+    "enable_tensorboard": true,
+    "enable_wandb": true,
+    "log_freq": 1,
+    "save_for_all_ranks": false,
+    "save_tb_folder": "tb"
+  },
+  "model": {
+    "config": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/configs/gdn_6_nsa_1_340M.json",
+    "converters": [],
+    "name": "fla",
+    "print_after_conversion": false,
+    "tokenizer_path": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/tokenizer"
+  },
+  "optimizer": {
+    "early_step_in_backward": false,
+    "eps": 1e-08,
+    "implementation": "fused",
+    "lr": 0.0003,
+    "name": "AdamW"
+  },
+  "profiling": {
+    "enable_memory_snapshot": false,
+    "enable_profiling": true,
+    "profile_freq": 512,
+    "save_memory_snapshot_folder": "memory_snapshot",
+    "save_traces_folder": "profile_trace"
+  },
+  "training": {
+    "batch_size": 8,
+    "compile": true,
+    "context_len": 8192,
+    "data_dir": null,
+    "data_files": null,
+    "data_parallel_replicate_degree": 1,
+    "data_parallel_shard_degree": -1,
+    "data_probs": "0.55,0.3,0.15",
+    "dataset": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro",
+    "dataset_name": "default,default,default",
+    "dataset_split": "train,train,train",
+    "deterministic": false,
+    "disable_loss_parallel": false,
+    "enable_cpu_offload": false,
+    "fsdp_reshard_after_forward": "default",
+    "gc_freq": 50,
+    "gradient_accumulation_steps": 2,
+    "max_norm": 1.0,
+    "mixed_precision_param": "bfloat16",
+    "mixed_precision_reduce": "float32",
+    "num_workers": 32,
+    "persistent_workers": false,
+    "pin_memory": false,
+    "prefetch_factor": 2,
+    "seed": 42,
+    "seq_len": 8192,
+    "skip_nan_inf": true,
+    "steps": 95366,
+    "streaming": true,
+    "tensor_parallel_degree": 1,
+    "varlen": false
+  }
+}[39m
+[titan] 2025-07-24 18:19:03,189 - root - INFO - [GC] Initial GC collection. 0.00 seconds.
+[titan] 2025-07-24 18:19:04,471 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
+[titan] 2025-07-24 18:19:04,473 - root - INFO - CUDA capacity: NVIDIA H20 with 95.00GiB memory
+[titan] 2025-07-24 18:19:04,528 - root - WARNING - Peak flops undefined for: NVIDIA H20, fallback to A100
+[titan] 2025-07-24 18:19:04,528 - root - INFO - Peak FLOPS used for computing MFU: 3.120e+14
+[titan] 2025-07-24 18:19:04,528 - root - INFO - Building 1-D device mesh with ['dp_shard'], [8]
+[titan] 2025-07-24 18:19:04,534 - root - INFO - Loading tokenizer...
+[titan] 2025-07-24 18:19:04,646 - root - INFO - LlamaTokenizerFast(name_or_path='/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/tokenizer', vocab_size=32000, model_max_length=10000000000, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
+	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
+	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
+	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
+}
+)
+[titan] 2025-07-24 18:19:04,646 - root - INFO - Loading dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro:default,default,default
+[titan] 2025-07-24 18:19:04,780 - root - INFO - Subset [36m/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample:default (p = 0.550)[39m:
+IterableDataset({
+    features: ['text', 'id', 'dump', 'url', 'file_path', 'language', 'language_score', 'token_count', 'score', 'int_score'],
+    num_shards: 140
+})
+[titan] 2025-07-24 18:19:04,780 - root - INFO - Shuffling the dataset with seed 42
+[titan] 2025-07-24 18:19:04,780 - root - WARNING - [31mDataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample has insufficient shards (140). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.[39m
+[titan] 2025-07-24 18:20:10,498 - root - INFO - Subset [36m/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged:default (p = 0.300)[39m:
+IterableDataset({
+    features: ['repo', 'content'],
+    num_shards: 1
+})
+[titan] 2025-07-24 18:20:10,498 - root - INFO - Shuffling the dataset with seed 42
+[titan] 2025-07-24 18:20:10,498 - root - WARNING - [31mDataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged has insufficient shards (1). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.[39m
+[titan] 2025-07-24 18:20:10,846 - root - INFO - Subset [36m/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro:default (p = 0.150)[39m:
+IterableDataset({
+    features: ['text', 'cc-path', 'domain', 'lang', 'lang_score', 'timestamp', 'url', 'math_score'],
+    num_shards: 100
+})
+[titan] 2025-07-24 18:20:10,846 - root - INFO - Shuffling the dataset with seed 42
+[titan] 2025-07-24 18:20:10,846 - root - WARNING - [31mDataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro has insufficient shards (100). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.[39m
+[titan] 2025-07-24 18:20:17,401 - root - INFO - Interleaving 3 datasets with probabilities [0.55, 0.3, 0.15]
+[titan] 2025-07-24 18:20:18,103 - root - INFO - IterableDataset({
+    features: ['text', 'content'],
+    num_shards: 256
+})
+[titan] 2025-07-24 18:20:18,229 - root - INFO - Building dataloader...
+[titan] 2025-07-24 18:20:18,232 - root - INFO - Loading model config from /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/configs/gdn_6_nsa_1_340M.json
+[titan] 2025-07-24 18:20:18,234 - root - INFO - Building model from the config
+[32mGatedDeltaNetConfig {
+  "allow_neg_eigval": false,
+  "architectures": [
+    "GatedDeltaNetForCausalLM"
+  ],
+  "attn": {
+    "block_counts": 16,
+    "block_size": 64,
+    "layers": [
+      5,
+      11,
+      17,
+      23
+    ],
+    "num_heads": 32,
+    "num_kv_heads": 2,
+    "qkv_bias": false,
+    "rope_theta": 160000.0,
+    "type": "nsa",
+    "window_size": null
+  },
+  "attn_mode": "chunk",
+  "bos_token_id": 1,
+  "conv_size": 4,
+  "eos_token_id": 2,
+  "expand_k": 1,
+  "expand_v": 1,
+  "fuse_cross_entropy": true,
+  "fuse_norm": true,
+  "fuse_swiglu": true,
+  "head_dim": 256,
+  "hidden_act": "swish",
+  "hidden_ratio": 4,
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "intermediate_size": null,
+  "max_position_embeddings": 8192,
+  "model_type": "gated_deltanet",
+  "norm_eps": 1e-06,
+  "norm_first": false,
+  "num_heads": 4,
+  "num_hidden_layers": 24,
+  "num_v_heads": null,
+  "qk_activation": "silu",
+  "qk_norm": "l2",
+  "tie_word_embeddings": false,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.53.3",
+  "use_beta": true,
+  "use_cache": true,
+  "use_gate": true,
+  "use_l2warp": false,
+  "use_output_norm": true,
+  "use_short_conv": true,
+  "vocab_size": 32000
+}
+[39m
+[titan] 2025-07-24 18:20:18,567 - root - INFO - [34m
+GatedDeltaNetForCausalLM(
+  (model): GatedDeltaNetModel(
+    (embeddings): Embedding(32000, 1024)
+    (layers): ModuleList(
+      (0-4): 5 x GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): GatedDeltaNet(
+          (q_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (a_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (b_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (q_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (k_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (v_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (g_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (o_norm): FusedRMSNormGated(256, eps=1e-06, activation=swish)
+          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (5): GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): NativeSparseAttention(
+          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (g_proj): Linear(in_features=1024, out_features=96, bias=False)
+          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
+          (rotary): RotaryEmbedding(dim=64, base=160000.0, interleaved=False, pos_idx_in_fp32=True)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (6-10): 5 x GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): GatedDeltaNet(
+          (q_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (a_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (b_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (q_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (k_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (v_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (g_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (o_norm): FusedRMSNormGated(256, eps=1e-06, activation=swish)
+          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (11): GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): NativeSparseAttention(
+          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (g_proj): Linear(in_features=1024, out_features=96, bias=False)
+          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
+          (rotary): RotaryEmbedding(dim=64, base=160000.0, interleaved=False, pos_idx_in_fp32=True)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (12-16): 5 x GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): GatedDeltaNet(
+          (q_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (a_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (b_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (q_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (k_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (v_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (g_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (o_norm): FusedRMSNormGated(256, eps=1e-06, activation=swish)
+          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (17): GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): NativeSparseAttention(
+          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (g_proj): Linear(in_features=1024, out_features=96, bias=False)
+          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
+          (rotary): RotaryEmbedding(dim=64, base=160000.0, interleaved=False, pos_idx_in_fp32=True)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (18-22): 5 x GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): GatedDeltaNet(
+          (q_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (a_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (b_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (q_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (k_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (v_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (g_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (o_norm): FusedRMSNormGated(256, eps=1e-06, activation=swish)
+          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (23): GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): NativeSparseAttention(
+          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (g_proj): Linear(in_features=1024, out_features=96, bias=False)
+          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
+          (rotary): RotaryEmbedding(dim=64, base=160000.0, interleaved=False, pos_idx_in_fp32=True)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+    )
+    (norm): RMSNorm(1024, eps=1e-06)
+  )
+  (lm_head): Linear(in_features=1024, out_features=32000, bias=False)
+  (criterion): FusedLinearCrossEntropyLoss()
+)[39m
+[titan] 2025-07-24 18:20:18,604 - root - INFO - Compiling each block with torch.compile
+[titan] 2025-07-24 18:20:18,604 - root - INFO - Compiling the embedding, norm, and lm_head layers with torch.compile
+[titan] 2025-07-24 18:20:18,605 - root - INFO - Compiling the entire model with torch.compile
+[titan] 2025-07-24 18:20:18,690 - root - INFO - Applied FSDP to the model
+[titan] 2025-07-24 18:20:18,757 - fla.models.gated_deltanet.modeling_gated_deltanet - WARNING - `A_log` is a DTensor, skipping initialization
+[titan] 2025-07-24 18:20:18,757 - fla.models.gated_deltanet.modeling_gated_deltanet - WARNING - `dt_bias` is a DTensor, skipping initialization
+[titan] 2025-07-24 18:20:18,858 - root - INFO - CUDA memory usage for model: 0.10GiB(0.10%)
+[titan] 2025-07-24 18:20:18,859 - root - WARNING - Warmup (100) + decay (95366) steps exceed total training steps (95366). Adjusting decay steps to 95266.
+[titan] 2025-07-24 18:20:18,884 - root - INFO - Checkpointing active. Checkpoints will be loaded from and saved to /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/checkpoint
+[titan] 2025-07-24 18:20:18,885 - root - INFO - CUDA capacity: NVIDIA H20 with 95.00GiB memory
+[titan] 2025-07-24 18:20:18,944 - root - WARNING - Peak flops undefined for: NVIDIA H20, fallback to A100
+[titan] 2025-07-24 18:20:27,383 - root - INFO - [31m***** Running training *****[39m
+[titan] 2025-07-24 18:20:27,385 - root - INFO - [32m  Training starts at step 1
+[titan] 2025-07-24 18:20:27,386 - root - INFO - [32m  Number of tokens per sequence = 8,192
+[titan] 2025-07-24 18:20:27,387 - root - INFO - [32m  Gradient Accumulation steps = 2
+[titan] 2025-07-24 18:20:27,387 - root - INFO - [32m  Instantaneous batch size (per device) = 8
+[titan] 2025-07-24 18:20:27,387 - root - INFO - [32m  Global batch size (w. parallel, distributed & accumulation) = 128 (1,048,576 tokens)
+[titan] 2025-07-24 18:20:27,387 - root - INFO - [32m  Total optimization steps = 95,366 (99,998,498,816 tokens)
+[titan] 2025-07-24 18:20:27,387 - root - INFO - [32m  Warmup steps = 100 (104,857,600 tokens)
+[titan] 2025-07-24 18:20:27,387 - root - INFO - [32m  Number of parameters = 396,695,712 [39m
+[titan] 2025-07-24 18:20:27,387 - root - INFO - Profiling active. Traces will be saved at /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/profile_trace
+/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/functions.py:1263: UserWarning: Dynamo does not know how to trace the builtin `cuda_utils.get_device_properties.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
+If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
+If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
+  torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
+[rank2]: Traceback (most recent call last):
+[rank2]:   File "<frozen runpy>", line 198, in _run_module_as_main
+[rank2]:   File "<frozen runpy>", line 88, in _run_code
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py", line 616, in <module>
+[rank2]:     main(config)
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
+[rank2]:     return f(*args, **kwargs)
+[rank2]:            ^^^^^^^^^^^^^^^^^^
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py", line 488, in main
+[rank2]:     output = model(
+[rank2]:              ^^^^^^
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
+[rank2]:     return self._call_impl(*args, **kwargs)
+[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
+[rank2]:     return inner()
+[rank2]:            ^^^^^^^
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1805, in inner
+[rank2]:     result = forward_call(*args, **kwargs)
+[rank2]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
+[rank2]:     return func(*args, **kwargs)
+[rank2]:            ^^^^^^^^^^^^^^^^^^^^^
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/gated_deltanet/modeling_gated_deltanet.py", line 424, in forward
+[rank2]:     outputs = self.model(
+[rank2]:               ^^^^^^^^^^^
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
+[rank2]:     return self._call_impl(*args, **kwargs)
+[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
+[rank2]:     return forward_call(*args, **kwargs)
+[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/gated_deltanet/modeling_gated_deltanet.py", line 294, in forward
+[rank2]:     hidden_states, attentions, past_key_values = layer(
+[rank2]:                                                  ^^^^^^
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
+[rank2]:     return self._call_impl(*args, **kwargs)
+[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
+[rank2]:     return inner()
+[rank2]:            ^^^^^^^
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1805, in inner
+[rank2]:     result = forward_call(*args, **kwargs)
+[rank2]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 655, in _fn
+[rank2]:     return fn(*args, **kwargs)
+[rank2]:            ^^^^^^^^^^^^^^^^^^^
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
+[rank2]:     return self._call_impl(*args, **kwargs)
+[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
+[rank2]:     return forward_call(*args, **kwargs)
+[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/gated_deltanet/modeling_gated_deltanet.py", line 108, in forward
+[rank2]:     hidden_states = self.attn_norm(hidden_states)
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/gated_deltanet/modeling_gated_deltanet.py", line 109, in torch_dynamo_resume_in_forward_at_108
+[rank2]:     hidden_states, attentions, past_key_values = self.attn(
+[rank2]:                                                  ^^^^^^^^^^
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
+[rank2]:     return self._call_impl(*args, **kwargs)
+[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
+[rank2]:     return forward_call(*args, **kwargs)
+[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/nsa.py", line 108, in forward
+[rank2]:     q, k = self.rotary(q, k, seqlen_offset=seqlen_offset, max_seqlen=max_seqlen, cu_seqlens=cu_seqlens)
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/nsa.py", line 123, in torch_dynamo_resume_in_forward_at_108
+[rank2]:     o = parallel_nsa(
+[rank2]:         ^^^^^^^^^^^^^
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/ops/nsa/parallel.py", line 838, in parallel_nsa
+[rank2]:     o_cmp, lse_cmp = parallel_nsa_compression(
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/ops/nsa/parallel.py", line 857, in torch_dynamo_resume_in_parallel_nsa_at_838
+[rank2]:     o = o_slc = ParallelNSAFunction.apply(q, k, v, block_indices, block_counts, block_size, scale, cu_seqlens)
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 1432, in __call__
+[rank2]:     return self._torchdynamo_orig_callable(
+[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 1213, in __call__
+[rank2]:     result = self._inner_convert(
+[rank2]:              ^^^^^^^^^^^^^^^^^^^^
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 598, in __call__
+[rank2]:     return _compile(
+[rank2]:            ^^^^^^^^^
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 1110, in _compile
+[rank2]:     raise InternalTorchDynamoError(
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 1059, in _compile
+[rank2]:     guarded_code = compile_inner(code, one_graph, hooks, transform)
+[rank2]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_utils_internal.py", line 97, in wrapper_function
+[rank2]:     return function(*args, **kwargs)
+[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 761, in compile_inner
+[rank2]:     return _compile_inner(code, one_graph, hooks, transform)
+[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 797, in _compile_inner
+[rank2]:     out_code = transform_code_object(code, transform)
+[rank2]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/bytecode_transformation.py", line 1422, in transform_code_object
+[rank2]:     transformations(instructions, code_options)
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 257, in _fn
+[rank2]:     return fn(*args, **kwargs)
+[rank2]:            ^^^^^^^^^^^^^^^^^^^
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 715, in transform
+[rank2]:     tracer.run()
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 3498, in run
+[rank2]:     super().run()
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 1337, in run
+[rank2]:     while self.step():
+[rank2]:           ^^^^^^^^^^^
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 1246, in step
+[rank2]:     self.dispatch_table[inst.opcode](self, inst)
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 2157, in COMPARE_OP
+[rank2]:     self.push(compare_op_handlers[inst.argval](self, self.popn(2), {}))
+[rank2]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 1111, in call_function
+[rank2]:     return handler(tx, args, kwargs)
+[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 789, in <lambda>
+[rank2]:     return lambda tx, args, kwargs: obj.call_function(
+[rank2]:                                     ^^^^^^^^^^^^^^^^^^
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 1111, in call_function
+[rank2]:     return handler(tx, args, kwargs)
+[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 945, in builtin_dispatch
+[rank2]:     rv = fn(tx, args, kwargs)
+[rank2]:          ^^^^^^^^^^^^^^^^^^^^
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 839, in call_binop_handlers
+[rank2]:     rv = fn(tx, *args)
+[rank2]:          ^^^^^^^^^^^^^
+[rank2]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 533, in compare_by_value
+[rank2]:     return ConstantVariable(op(a.value, b.value))
+[rank2]:                             ^^^^^^^^^^^^^^^^^^^^
+[rank2]: torch._dynamo.exc.InternalTorchDynamoError: TypeError: '>' not supported between instances of 'NoneType' and 'int'
+[rank2]: from user code:
+[rank2]:    File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/ops/nsa/parallel.py", line 862, in torch_dynamo_resume_in_parallel_nsa_at_857
+[rank2]:     if window_size > 0:
+[rank2]: Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/2/stdout.log ADDED Viewed

File without changes

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/3/error.json ADDED Viewed

	@@ -0,0 +1 @@

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/3/stderr.log ADDED Viewed

	@@ -0,0 +1,571 @@

+[titan] 2025-07-24 18:19:05,907 - root - INFO - Starting job: default job
+[titan] 2025-07-24 18:19:05,908 - root - INFO - [32m{
+  "activation_checkpoint": {
+    "mode": "none",
+    "selective_ac_option": "2"
+  },
+  "activation_offload": {
+    "mode": "none"
+  },
+  "checkpoint": {
+    "async_mode": "disabled",
+    "create_seed_checkpoint": false,
+    "enable_checkpoint": true,
+    "exclude_from_loading": [],
+    "export_dtype": "bfloat16",
+    "folder": "checkpoint",
+    "interval": 8192,
+    "interval_type": "steps",
+    "keep_latest_k": 100,
+    "load_step": -1,
+    "model_weights_only": false
+  },
+  "comm": {
+    "init_timeout_seconds": 300,
+    "trace_buf_size": 20000,
+    "train_timeout_seconds": 100
+  },
+  "experimental": {
+    "context_parallel_degree": 1,
+    "context_parallel_rotate_method": "allgather",
+    "custom_model_path": "",
+    "enable_async_tensor_parallel": false,
+    "enable_compiled_autograd": false,
+    "pipeline_parallel_degree": 1,
+    "pipeline_parallel_microbatches": null,
+    "pipeline_parallel_schedule": "1F1B",
+    "pipeline_parallel_schedule_csv": "",
+    "pipeline_parallel_split_points": []
+  },
+  "fault_tolerance": {
+    "enable": false,
+    "group_size": 0,
+    "min_replica_size": 1,
+    "replica_id": 0
+  },
+  "float8": {
+    "enable_fsdp_float8_all_gather": false,
+    "force_recompute_fp8_weight_in_bwd": false,
+    "precompute_float8_dynamic_scale_for_fsdp": false,
+    "recipe_name": null
+  },
+  "job": {
+    "config_file": "flame/models/fla.toml",
+    "description": "default job",
+    "dump_folder": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2",
+    "print_args": true,
+    "use_for_integration_test": false
+  },
+  "lr_scheduler": {
+    "decay_ratio": 1.0,
+    "decay_type": "linear",
+    "lr_min": 0.01,
+    "warmup_steps": 100
+  },
+  "memory_estimation": {
+    "disable_fake_mode": false,
+    "enabled": false
+  },
+  "metrics": {
+    "disable_color_printing": false,
+    "enable_tensorboard": true,
+    "enable_wandb": true,
+    "log_freq": 1,
+    "save_for_all_ranks": false,
+    "save_tb_folder": "tb"
+  },
+  "model": {
+    "config": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/configs/gdn_6_nsa_1_340M.json",
+    "converters": [],
+    "name": "fla",
+    "print_after_conversion": false,
+    "tokenizer_path": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/tokenizer"
+  },
+  "optimizer": {
+    "early_step_in_backward": false,
+    "eps": 1e-08,
+    "implementation": "fused",
+    "lr": 0.0003,
+    "name": "AdamW"
+  },
+  "profiling": {
+    "enable_memory_snapshot": false,
+    "enable_profiling": true,
+    "profile_freq": 512,
+    "save_memory_snapshot_folder": "memory_snapshot",
+    "save_traces_folder": "profile_trace"
+  },
+  "training": {
+    "batch_size": 8,
+    "compile": true,
+    "context_len": 8192,
+    "data_dir": null,
+    "data_files": null,
+    "data_parallel_replicate_degree": 1,
+    "data_parallel_shard_degree": -1,
+    "data_probs": "0.55,0.3,0.15",
+    "dataset": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro",
+    "dataset_name": "default,default,default",
+    "dataset_split": "train,train,train",
+    "deterministic": false,
+    "disable_loss_parallel": false,
+    "enable_cpu_offload": false,
+    "fsdp_reshard_after_forward": "default",
+    "gc_freq": 50,
+    "gradient_accumulation_steps": 2,
+    "max_norm": 1.0,
+    "mixed_precision_param": "bfloat16",
+    "mixed_precision_reduce": "float32",
+    "num_workers": 32,
+    "persistent_workers": false,
+    "pin_memory": false,
+    "prefetch_factor": 2,
+    "seed": 42,
+    "seq_len": 8192,
+    "skip_nan_inf": true,
+    "steps": 95366,
+    "streaming": true,
+    "tensor_parallel_degree": 1,
+    "varlen": false
+  }
+}[39m
+[titan] 2025-07-24 18:19:05,908 - root - INFO - [GC] Initial GC collection. 0.00 seconds.
+[titan] 2025-07-24 18:19:06,594 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
+[titan] 2025-07-24 18:19:06,597 - root - INFO - CUDA capacity: NVIDIA H20 with 95.00GiB memory
+[titan] 2025-07-24 18:19:06,654 - root - WARNING - Peak flops undefined for: NVIDIA H20, fallback to A100
+[titan] 2025-07-24 18:19:06,654 - root - INFO - Peak FLOPS used for computing MFU: 3.120e+14
+[titan] 2025-07-24 18:19:06,654 - root - INFO - Building 1-D device mesh with ['dp_shard'], [8]
+[titan] 2025-07-24 18:19:06,686 - root - INFO - Loading tokenizer...
+[titan] 2025-07-24 18:19:06,794 - root - INFO - LlamaTokenizerFast(name_or_path='/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/tokenizer', vocab_size=32000, model_max_length=10000000000, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
+	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
+	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
+	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
+}
+)
+[titan] 2025-07-24 18:19:06,794 - root - INFO - Loading dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro:default,default,default
+[titan] 2025-07-24 18:19:06,906 - root - INFO - Subset [36m/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample:default (p = 0.550)[39m:
+IterableDataset({
+    features: ['text', 'id', 'dump', 'url', 'file_path', 'language', 'language_score', 'token_count', 'score', 'int_score'],
+    num_shards: 140
+})
+[titan] 2025-07-24 18:19:06,906 - root - INFO - Shuffling the dataset with seed 42
+[titan] 2025-07-24 18:19:06,906 - root - WARNING - [31mDataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample has insufficient shards (140). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.[39m
+[titan] 2025-07-24 18:19:57,464 - root - INFO - Subset [36m/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged:default (p = 0.300)[39m:
+IterableDataset({
+    features: ['repo', 'content'],
+    num_shards: 1
+})
+[titan] 2025-07-24 18:19:57,464 - root - INFO - Shuffling the dataset with seed 42
+[titan] 2025-07-24 18:19:57,464 - root - WARNING - [31mDataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged has insufficient shards (1). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.[39m
+[titan] 2025-07-24 18:19:57,792 - root - INFO - Subset [36m/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro:default (p = 0.150)[39m:
+IterableDataset({
+    features: ['text', 'cc-path', 'domain', 'lang', 'lang_score', 'timestamp', 'url', 'math_score'],
+    num_shards: 100
+})
+[titan] 2025-07-24 18:19:57,792 - root - INFO - Shuffling the dataset with seed 42
+[titan] 2025-07-24 18:19:57,793 - root - WARNING - [31mDataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro has insufficient shards (100). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.[39m
+[titan] 2025-07-24 18:20:03,981 - root - INFO - Interleaving 3 datasets with probabilities [0.55, 0.3, 0.15]
+[titan] 2025-07-24 18:20:04,684 - root - INFO - IterableDataset({
+    features: ['text', 'content'],
+    num_shards: 256
+})
+[titan] 2025-07-24 18:20:04,815 - root - INFO - Building dataloader...
+[titan] 2025-07-24 18:20:04,817 - root - INFO - Loading model config from /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/configs/gdn_6_nsa_1_340M.json
+[titan] 2025-07-24 18:20:04,819 - root - INFO - Building model from the config
+[32mGatedDeltaNetConfig {
+  "allow_neg_eigval": false,
+  "architectures": [
+    "GatedDeltaNetForCausalLM"
+  ],
+  "attn": {
+    "block_counts": 16,
+    "block_size": 64,
+    "layers": [
+      5,
+      11,
+      17,
+      23
+    ],
+    "num_heads": 32,
+    "num_kv_heads": 2,
+    "qkv_bias": false,
+    "rope_theta": 160000.0,
+    "type": "nsa",
+    "window_size": null
+  },
+  "attn_mode": "chunk",
+  "bos_token_id": 1,
+  "conv_size": 4,
+  "eos_token_id": 2,
+  "expand_k": 1,
+  "expand_v": 1,
+  "fuse_cross_entropy": true,
+  "fuse_norm": true,
+  "fuse_swiglu": true,
+  "head_dim": 256,
+  "hidden_act": "swish",
+  "hidden_ratio": 4,
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "intermediate_size": null,
+  "max_position_embeddings": 8192,
+  "model_type": "gated_deltanet",
+  "norm_eps": 1e-06,
+  "norm_first": false,
+  "num_heads": 4,
+  "num_hidden_layers": 24,
+  "num_v_heads": null,
+  "qk_activation": "silu",
+  "qk_norm": "l2",
+  "tie_word_embeddings": false,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.53.3",
+  "use_beta": true,
+  "use_cache": true,
+  "use_gate": true,
+  "use_l2warp": false,
+  "use_output_norm": true,
+  "use_short_conv": true,
+  "vocab_size": 32000
+}
+[39m
+[titan] 2025-07-24 18:20:05,139 - root - INFO - [34m
+GatedDeltaNetForCausalLM(
+  (model): GatedDeltaNetModel(
+    (embeddings): Embedding(32000, 1024)
+    (layers): ModuleList(
+      (0-4): 5 x GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): GatedDeltaNet(
+          (q_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (a_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (b_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (q_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (k_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (v_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (g_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (o_norm): FusedRMSNormGated(256, eps=1e-06, activation=swish)
+          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (5): GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): NativeSparseAttention(
+          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (g_proj): Linear(in_features=1024, out_features=96, bias=False)
+          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
+          (rotary): RotaryEmbedding(dim=64, base=160000.0, interleaved=False, pos_idx_in_fp32=True)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (6-10): 5 x GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): GatedDeltaNet(
+          (q_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (a_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (b_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (q_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (k_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (v_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (g_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (o_norm): FusedRMSNormGated(256, eps=1e-06, activation=swish)
+          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (11): GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): NativeSparseAttention(
+          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (g_proj): Linear(in_features=1024, out_features=96, bias=False)
+          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
+          (rotary): RotaryEmbedding(dim=64, base=160000.0, interleaved=False, pos_idx_in_fp32=True)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (12-16): 5 x GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): GatedDeltaNet(
+          (q_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (a_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (b_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (q_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (k_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (v_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (g_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (o_norm): FusedRMSNormGated(256, eps=1e-06, activation=swish)
+          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (17): GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): NativeSparseAttention(
+          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (g_proj): Linear(in_features=1024, out_features=96, bias=False)
+          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
+          (rotary): RotaryEmbedding(dim=64, base=160000.0, interleaved=False, pos_idx_in_fp32=True)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (18-22): 5 x GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): GatedDeltaNet(
+          (q_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (a_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (b_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (q_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (k_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (v_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (g_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (o_norm): FusedRMSNormGated(256, eps=1e-06, activation=swish)
+          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (23): GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): NativeSparseAttention(
+          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (g_proj): Linear(in_features=1024, out_features=96, bias=False)
+          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
+          (rotary): RotaryEmbedding(dim=64, base=160000.0, interleaved=False, pos_idx_in_fp32=True)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+    )
+    (norm): RMSNorm(1024, eps=1e-06)
+  )
+  (lm_head): Linear(in_features=1024, out_features=32000, bias=False)
+  (criterion): FusedLinearCrossEntropyLoss()
+)[39m
+[titan] 2025-07-24 18:20:05,174 - root - INFO - Compiling each block with torch.compile
+[titan] 2025-07-24 18:20:05,174 - root - INFO - Compiling the embedding, norm, and lm_head layers with torch.compile
+[titan] 2025-07-24 18:20:05,174 - root - INFO - Compiling the entire model with torch.compile
+[titan] 2025-07-24 18:20:05,254 - root - INFO - Applied FSDP to the model
+[titan] 2025-07-24 18:20:05,399 - fla.models.gated_deltanet.modeling_gated_deltanet - WARNING - `A_log` is a DTensor, skipping initialization
+[titan] 2025-07-24 18:20:05,399 - fla.models.gated_deltanet.modeling_gated_deltanet - WARNING - `dt_bias` is a DTensor, skipping initialization
+[titan] 2025-07-24 18:20:05,495 - root - INFO - CUDA memory usage for model: 0.10GiB(0.10%)
+[titan] 2025-07-24 18:20:05,497 - root - WARNING - Warmup (100) + decay (95366) steps exceed total training steps (95366). Adjusting decay steps to 95266.
+[titan] 2025-07-24 18:20:05,520 - root - INFO - Checkpointing active. Checkpoints will be loaded from and saved to /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/checkpoint
+[titan] 2025-07-24 18:20:05,521 - root - INFO - CUDA capacity: NVIDIA H20 with 95.00GiB memory
+[titan] 2025-07-24 18:20:05,590 - root - WARNING - Peak flops undefined for: NVIDIA H20, fallback to A100
+[titan] 2025-07-24 18:20:11,953 - root - INFO - [31m***** Running training *****[39m
+[titan] 2025-07-24 18:20:11,955 - root - INFO - [32m  Training starts at step 1
+[titan] 2025-07-24 18:20:11,956 - root - INFO - [32m  Number of tokens per sequence = 8,192
+[titan] 2025-07-24 18:20:11,956 - root - INFO - [32m  Gradient Accumulation steps = 2
+[titan] 2025-07-24 18:20:11,957 - root - INFO - [32m  Instantaneous batch size (per device) = 8
+[titan] 2025-07-24 18:20:11,957 - root - INFO - [32m  Global batch size (w. parallel, distributed & accumulation) = 128 (1,048,576 tokens)
+[titan] 2025-07-24 18:20:11,957 - root - INFO - [32m  Total optimization steps = 95,366 (99,998,498,816 tokens)
+[titan] 2025-07-24 18:20:11,957 - root - INFO - [32m  Warmup steps = 100 (104,857,600 tokens)
+[titan] 2025-07-24 18:20:11,957 - root - INFO - [32m  Number of parameters = 396,695,712 [39m
+[titan] 2025-07-24 18:20:11,957 - root - INFO - Profiling active. Traces will be saved at /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/profile_trace
+/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/functions.py:1263: UserWarning: Dynamo does not know how to trace the builtin `cuda_utils.get_device_properties.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
+If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
+If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
+  torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
+[rank3]: Traceback (most recent call last):
+[rank3]:   File "<frozen runpy>", line 198, in _run_module_as_main
+[rank3]:   File "<frozen runpy>", line 88, in _run_code
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py", line 616, in <module>
+[rank3]:     main(config)
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
+[rank3]:     return f(*args, **kwargs)
+[rank3]:            ^^^^^^^^^^^^^^^^^^
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py", line 488, in main
+[rank3]:     output = model(
+[rank3]:              ^^^^^^
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
+[rank3]:     return self._call_impl(*args, **kwargs)
+[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
+[rank3]:     return inner()
+[rank3]:            ^^^^^^^
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1805, in inner
+[rank3]:     result = forward_call(*args, **kwargs)
+[rank3]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
+[rank3]:     return func(*args, **kwargs)
+[rank3]:            ^^^^^^^^^^^^^^^^^^^^^
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/gated_deltanet/modeling_gated_deltanet.py", line 424, in forward
+[rank3]:     outputs = self.model(
+[rank3]:               ^^^^^^^^^^^
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
+[rank3]:     return self._call_impl(*args, **kwargs)
+[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
+[rank3]:     return forward_call(*args, **kwargs)
+[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/gated_deltanet/modeling_gated_deltanet.py", line 294, in forward
+[rank3]:     hidden_states, attentions, past_key_values = layer(
+[rank3]:                                                  ^^^^^^
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
+[rank3]:     return self._call_impl(*args, **kwargs)
+[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
+[rank3]:     return inner()
+[rank3]:            ^^^^^^^
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1805, in inner
+[rank3]:     result = forward_call(*args, **kwargs)
+[rank3]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 655, in _fn
+[rank3]:     return fn(*args, **kwargs)
+[rank3]:            ^^^^^^^^^^^^^^^^^^^
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
+[rank3]:     return self._call_impl(*args, **kwargs)
+[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
+[rank3]:     return forward_call(*args, **kwargs)
+[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/gated_deltanet/modeling_gated_deltanet.py", line 108, in forward
+[rank3]:     hidden_states = self.attn_norm(hidden_states)
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/gated_deltanet/modeling_gated_deltanet.py", line 109, in torch_dynamo_resume_in_forward_at_108
+[rank3]:     hidden_states, attentions, past_key_values = self.attn(
+[rank3]:                                                  ^^^^^^^^^^
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
+[rank3]:     return self._call_impl(*args, **kwargs)
+[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
+[rank3]:     return forward_call(*args, **kwargs)
+[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/nsa.py", line 108, in forward
+[rank3]:     q, k = self.rotary(q, k, seqlen_offset=seqlen_offset, max_seqlen=max_seqlen, cu_seqlens=cu_seqlens)
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/nsa.py", line 123, in torch_dynamo_resume_in_forward_at_108
+[rank3]:     o = parallel_nsa(
+[rank3]:         ^^^^^^^^^^^^^
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/ops/nsa/parallel.py", line 838, in parallel_nsa
+[rank3]:     o_cmp, lse_cmp = parallel_nsa_compression(
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/ops/nsa/parallel.py", line 857, in torch_dynamo_resume_in_parallel_nsa_at_838
+[rank3]:     o = o_slc = ParallelNSAFunction.apply(q, k, v, block_indices, block_counts, block_size, scale, cu_seqlens)
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 1432, in __call__
+[rank3]:     return self._torchdynamo_orig_callable(
+[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 1213, in __call__
+[rank3]:     result = self._inner_convert(
+[rank3]:              ^^^^^^^^^^^^^^^^^^^^
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 598, in __call__
+[rank3]:     return _compile(
+[rank3]:            ^^^^^^^^^
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 1110, in _compile
+[rank3]:     raise InternalTorchDynamoError(
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 1059, in _compile
+[rank3]:     guarded_code = compile_inner(code, one_graph, hooks, transform)
+[rank3]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_utils_internal.py", line 97, in wrapper_function
+[rank3]:     return function(*args, **kwargs)
+[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 761, in compile_inner
+[rank3]:     return _compile_inner(code, one_graph, hooks, transform)
+[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 797, in _compile_inner
+[rank3]:     out_code = transform_code_object(code, transform)
+[rank3]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/bytecode_transformation.py", line 1422, in transform_code_object
+[rank3]:     transformations(instructions, code_options)
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 257, in _fn
+[rank3]:     return fn(*args, **kwargs)
+[rank3]:            ^^^^^^^^^^^^^^^^^^^
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 715, in transform
+[rank3]:     tracer.run()
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 3498, in run
+[rank3]:     super().run()
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 1337, in run
+[rank3]:     while self.step():
+[rank3]:           ^^^^^^^^^^^
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 1246, in step
+[rank3]:     self.dispatch_table[inst.opcode](self, inst)
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 2157, in COMPARE_OP
+[rank3]:     self.push(compare_op_handlers[inst.argval](self, self.popn(2), {}))
+[rank3]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 1111, in call_function
+[rank3]:     return handler(tx, args, kwargs)
+[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 789, in <lambda>
+[rank3]:     return lambda tx, args, kwargs: obj.call_function(
+[rank3]:                                     ^^^^^^^^^^^^^^^^^^
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 1111, in call_function
+[rank3]:     return handler(tx, args, kwargs)
+[rank3]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 945, in builtin_dispatch
+[rank3]:     rv = fn(tx, args, kwargs)
+[rank3]:          ^^^^^^^^^^^^^^^^^^^^
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 839, in call_binop_handlers
+[rank3]:     rv = fn(tx, *args)
+[rank3]:          ^^^^^^^^^^^^^
+[rank3]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 533, in compare_by_value
+[rank3]:     return ConstantVariable(op(a.value, b.value))
+[rank3]:                             ^^^^^^^^^^^^^^^^^^^^
+[rank3]: torch._dynamo.exc.InternalTorchDynamoError: TypeError: '>' not supported between instances of 'NoneType' and 'int'
+[rank3]: from user code:
+[rank3]:    File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/ops/nsa/parallel.py", line 862, in torch_dynamo_resume_in_parallel_nsa_at_857
+[rank3]:     if window_size > 0:
+[rank3]: Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/3/stdout.log ADDED Viewed

File without changes

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/4/error.json ADDED Viewed

	@@ -0,0 +1 @@

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/4/stderr.log ADDED Viewed

	@@ -0,0 +1,571 @@

+[titan] 2025-07-24 18:19:07,134 - root - INFO - Starting job: default job
+[titan] 2025-07-24 18:19:07,134 - root - INFO - [32m{
+  "activation_checkpoint": {
+    "mode": "none",
+    "selective_ac_option": "2"
+  },
+  "activation_offload": {
+    "mode": "none"
+  },
+  "checkpoint": {
+    "async_mode": "disabled",
+    "create_seed_checkpoint": false,
+    "enable_checkpoint": true,
+    "exclude_from_loading": [],
+    "export_dtype": "bfloat16",
+    "folder": "checkpoint",
+    "interval": 8192,
+    "interval_type": "steps",
+    "keep_latest_k": 100,
+    "load_step": -1,
+    "model_weights_only": false
+  },
+  "comm": {
+    "init_timeout_seconds": 300,
+    "trace_buf_size": 20000,
+    "train_timeout_seconds": 100
+  },
+  "experimental": {
+    "context_parallel_degree": 1,
+    "context_parallel_rotate_method": "allgather",
+    "custom_model_path": "",
+    "enable_async_tensor_parallel": false,
+    "enable_compiled_autograd": false,
+    "pipeline_parallel_degree": 1,
+    "pipeline_parallel_microbatches": null,
+    "pipeline_parallel_schedule": "1F1B",
+    "pipeline_parallel_schedule_csv": "",
+    "pipeline_parallel_split_points": []
+  },
+  "fault_tolerance": {
+    "enable": false,
+    "group_size": 0,
+    "min_replica_size": 1,
+    "replica_id": 0
+  },
+  "float8": {
+    "enable_fsdp_float8_all_gather": false,
+    "force_recompute_fp8_weight_in_bwd": false,
+    "precompute_float8_dynamic_scale_for_fsdp": false,
+    "recipe_name": null
+  },
+  "job": {
+    "config_file": "flame/models/fla.toml",
+    "description": "default job",
+    "dump_folder": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2",
+    "print_args": true,
+    "use_for_integration_test": false
+  },
+  "lr_scheduler": {
+    "decay_ratio": 1.0,
+    "decay_type": "linear",
+    "lr_min": 0.01,
+    "warmup_steps": 100
+  },
+  "memory_estimation": {
+    "disable_fake_mode": false,
+    "enabled": false
+  },
+  "metrics": {
+    "disable_color_printing": false,
+    "enable_tensorboard": true,
+    "enable_wandb": true,
+    "log_freq": 1,
+    "save_for_all_ranks": false,
+    "save_tb_folder": "tb"
+  },
+  "model": {
+    "config": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/configs/gdn_6_nsa_1_340M.json",
+    "converters": [],
+    "name": "fla",
+    "print_after_conversion": false,
+    "tokenizer_path": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/tokenizer"
+  },
+  "optimizer": {
+    "early_step_in_backward": false,
+    "eps": 1e-08,
+    "implementation": "fused",
+    "lr": 0.0003,
+    "name": "AdamW"
+  },
+  "profiling": {
+    "enable_memory_snapshot": false,
+    "enable_profiling": true,
+    "profile_freq": 512,
+    "save_memory_snapshot_folder": "memory_snapshot",
+    "save_traces_folder": "profile_trace"
+  },
+  "training": {
+    "batch_size": 8,
+    "compile": true,
+    "context_len": 8192,
+    "data_dir": null,
+    "data_files": null,
+    "data_parallel_replicate_degree": 1,
+    "data_parallel_shard_degree": -1,
+    "data_probs": "0.55,0.3,0.15",
+    "dataset": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro",
+    "dataset_name": "default,default,default",
+    "dataset_split": "train,train,train",
+    "deterministic": false,
+    "disable_loss_parallel": false,
+    "enable_cpu_offload": false,
+    "fsdp_reshard_after_forward": "default",
+    "gc_freq": 50,
+    "gradient_accumulation_steps": 2,
+    "max_norm": 1.0,
+    "mixed_precision_param": "bfloat16",
+    "mixed_precision_reduce": "float32",
+    "num_workers": 32,
+    "persistent_workers": false,
+    "pin_memory": false,
+    "prefetch_factor": 2,
+    "seed": 42,
+    "seq_len": 8192,
+    "skip_nan_inf": true,
+    "steps": 95366,
+    "streaming": true,
+    "tensor_parallel_degree": 1,
+    "varlen": false
+  }
+}[39m
+[titan] 2025-07-24 18:19:07,135 - root - INFO - [GC] Initial GC collection. 0.00 seconds.
+[titan] 2025-07-24 18:19:07,697 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
+[titan] 2025-07-24 18:19:07,700 - root - INFO - CUDA capacity: NVIDIA H20 with 95.00GiB memory
+[titan] 2025-07-24 18:19:07,780 - root - WARNING - Peak flops undefined for: NVIDIA H20, fallback to A100
+[titan] 2025-07-24 18:19:07,780 - root - INFO - Peak FLOPS used for computing MFU: 3.120e+14
+[titan] 2025-07-24 18:19:07,780 - root - INFO - Building 1-D device mesh with ['dp_shard'], [8]
+[titan] 2025-07-24 18:19:07,847 - root - INFO - Loading tokenizer...
+[titan] 2025-07-24 18:19:07,960 - root - INFO - LlamaTokenizerFast(name_or_path='/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/tokenizer', vocab_size=32000, model_max_length=10000000000, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
+	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
+	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
+	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
+}
+)
+[titan] 2025-07-24 18:19:07,960 - root - INFO - Loading dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro:default,default,default
+[titan] 2025-07-24 18:19:08,077 - root - INFO - Subset [36m/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample:default (p = 0.550)[39m:
+IterableDataset({
+    features: ['text', 'id', 'dump', 'url', 'file_path', 'language', 'language_score', 'token_count', 'score', 'int_score'],
+    num_shards: 140
+})
+[titan] 2025-07-24 18:19:08,077 - root - INFO - Shuffling the dataset with seed 42
+[titan] 2025-07-24 18:19:08,078 - root - WARNING - [31mDataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample has insufficient shards (140). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.[39m
+[titan] 2025-07-24 18:19:59,309 - root - INFO - Subset [36m/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged:default (p = 0.300)[39m:
+IterableDataset({
+    features: ['repo', 'content'],
+    num_shards: 1
+})
+[titan] 2025-07-24 18:19:59,309 - root - INFO - Shuffling the dataset with seed 42
+[titan] 2025-07-24 18:19:59,309 - root - WARNING - [31mDataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged has insufficient shards (1). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.[39m
+[titan] 2025-07-24 18:19:59,651 - root - INFO - Subset [36m/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro:default (p = 0.150)[39m:
+IterableDataset({
+    features: ['text', 'cc-path', 'domain', 'lang', 'lang_score', 'timestamp', 'url', 'math_score'],
+    num_shards: 100
+})
+[titan] 2025-07-24 18:19:59,652 - root - INFO - Shuffling the dataset with seed 42
+[titan] 2025-07-24 18:19:59,652 - root - WARNING - [31mDataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro has insufficient shards (100). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.[39m
+[titan] 2025-07-24 18:20:05,814 - root - INFO - Interleaving 3 datasets with probabilities [0.55, 0.3, 0.15]
+[titan] 2025-07-24 18:20:06,503 - root - INFO - IterableDataset({
+    features: ['text', 'content'],
+    num_shards: 256
+})
+[titan] 2025-07-24 18:20:06,623 - root - INFO - Building dataloader...
+[titan] 2025-07-24 18:20:06,625 - root - INFO - Loading model config from /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/configs/gdn_6_nsa_1_340M.json
+[titan] 2025-07-24 18:20:06,627 - root - INFO - Building model from the config
+[32mGatedDeltaNetConfig {
+  "allow_neg_eigval": false,
+  "architectures": [
+    "GatedDeltaNetForCausalLM"
+  ],
+  "attn": {
+    "block_counts": 16,
+    "block_size": 64,
+    "layers": [
+      5,
+      11,
+      17,
+      23
+    ],
+    "num_heads": 32,
+    "num_kv_heads": 2,
+    "qkv_bias": false,
+    "rope_theta": 160000.0,
+    "type": "nsa",
+    "window_size": null
+  },
+  "attn_mode": "chunk",
+  "bos_token_id": 1,
+  "conv_size": 4,
+  "eos_token_id": 2,
+  "expand_k": 1,
+  "expand_v": 1,
+  "fuse_cross_entropy": true,
+  "fuse_norm": true,
+  "fuse_swiglu": true,
+  "head_dim": 256,
+  "hidden_act": "swish",
+  "hidden_ratio": 4,
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "intermediate_size": null,
+  "max_position_embeddings": 8192,
+  "model_type": "gated_deltanet",
+  "norm_eps": 1e-06,
+  "norm_first": false,
+  "num_heads": 4,
+  "num_hidden_layers": 24,
+  "num_v_heads": null,
+  "qk_activation": "silu",
+  "qk_norm": "l2",
+  "tie_word_embeddings": false,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.53.3",
+  "use_beta": true,
+  "use_cache": true,
+  "use_gate": true,
+  "use_l2warp": false,
+  "use_output_norm": true,
+  "use_short_conv": true,
+  "vocab_size": 32000
+}
+[39m
+[titan] 2025-07-24 18:20:06,946 - root - INFO - [34m
+GatedDeltaNetForCausalLM(
+  (model): GatedDeltaNetModel(
+    (embeddings): Embedding(32000, 1024)
+    (layers): ModuleList(
+      (0-4): 5 x GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): GatedDeltaNet(
+          (q_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (a_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (b_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (q_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (k_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (v_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (g_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (o_norm): FusedRMSNormGated(256, eps=1e-06, activation=swish)
+          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (5): GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): NativeSparseAttention(
+          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (g_proj): Linear(in_features=1024, out_features=96, bias=False)
+          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
+          (rotary): RotaryEmbedding(dim=64, base=160000.0, interleaved=False, pos_idx_in_fp32=True)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (6-10): 5 x GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): GatedDeltaNet(
+          (q_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (a_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (b_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (q_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (k_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (v_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (g_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (o_norm): FusedRMSNormGated(256, eps=1e-06, activation=swish)
+          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (11): GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): NativeSparseAttention(
+          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (g_proj): Linear(in_features=1024, out_features=96, bias=False)
+          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
+          (rotary): RotaryEmbedding(dim=64, base=160000.0, interleaved=False, pos_idx_in_fp32=True)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (12-16): 5 x GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): GatedDeltaNet(
+          (q_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (a_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (b_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (q_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (k_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (v_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (g_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (o_norm): FusedRMSNormGated(256, eps=1e-06, activation=swish)
+          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (17): GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): NativeSparseAttention(
+          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (g_proj): Linear(in_features=1024, out_features=96, bias=False)
+          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
+          (rotary): RotaryEmbedding(dim=64, base=160000.0, interleaved=False, pos_idx_in_fp32=True)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (18-22): 5 x GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): GatedDeltaNet(
+          (q_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (a_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (b_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (q_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (k_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (v_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (g_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (o_norm): FusedRMSNormGated(256, eps=1e-06, activation=swish)
+          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (23): GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): NativeSparseAttention(
+          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (g_proj): Linear(in_features=1024, out_features=96, bias=False)
+          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
+          (rotary): RotaryEmbedding(dim=64, base=160000.0, interleaved=False, pos_idx_in_fp32=True)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+    )
+    (norm): RMSNorm(1024, eps=1e-06)
+  )
+  (lm_head): Linear(in_features=1024, out_features=32000, bias=False)
+  (criterion): FusedLinearCrossEntropyLoss()
+)[39m
+[titan] 2025-07-24 18:20:06,981 - root - INFO - Compiling each block with torch.compile
+[titan] 2025-07-24 18:20:06,981 - root - INFO - Compiling the embedding, norm, and lm_head layers with torch.compile
+[titan] 2025-07-24 18:20:06,981 - root - INFO - Compiling the entire model with torch.compile
+[titan] 2025-07-24 18:20:07,063 - root - INFO - Applied FSDP to the model
+[titan] 2025-07-24 18:20:07,129 - fla.models.gated_deltanet.modeling_gated_deltanet - WARNING - `A_log` is a DTensor, skipping initialization
+[titan] 2025-07-24 18:20:07,129 - fla.models.gated_deltanet.modeling_gated_deltanet - WARNING - `dt_bias` is a DTensor, skipping initialization
+[titan] 2025-07-24 18:20:07,218 - root - INFO - CUDA memory usage for model: 0.10GiB(0.10%)
+[titan] 2025-07-24 18:20:07,220 - root - WARNING - Warmup (100) + decay (95366) steps exceed total training steps (95366). Adjusting decay steps to 95266.
+[titan] 2025-07-24 18:20:07,243 - root - INFO - Checkpointing active. Checkpoints will be loaded from and saved to /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/checkpoint
+[titan] 2025-07-24 18:20:07,243 - root - INFO - CUDA capacity: NVIDIA H20 with 95.00GiB memory
+[titan] 2025-07-24 18:20:07,298 - root - WARNING - Peak flops undefined for: NVIDIA H20, fallback to A100
+[titan] 2025-07-24 18:20:13,800 - root - INFO - [31m***** Running training *****[39m
+[titan] 2025-07-24 18:20:13,803 - root - INFO - [32m  Training starts at step 1
+[titan] 2025-07-24 18:20:13,803 - root - INFO - [32m  Number of tokens per sequence = 8,192
+[titan] 2025-07-24 18:20:13,803 - root - INFO - [32m  Gradient Accumulation steps = 2
+[titan] 2025-07-24 18:20:13,803 - root - INFO - [32m  Instantaneous batch size (per device) = 8
+[titan] 2025-07-24 18:20:13,803 - root - INFO - [32m  Global batch size (w. parallel, distributed & accumulation) = 128 (1,048,576 tokens)
+[titan] 2025-07-24 18:20:13,803 - root - INFO - [32m  Total optimization steps = 95,366 (99,998,498,816 tokens)
+[titan] 2025-07-24 18:20:13,804 - root - INFO - [32m  Warmup steps = 100 (104,857,600 tokens)
+[titan] 2025-07-24 18:20:13,804 - root - INFO - [32m  Number of parameters = 396,695,712 [39m
+[titan] 2025-07-24 18:20:13,804 - root - INFO - Profiling active. Traces will be saved at /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/profile_trace
+/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/functions.py:1263: UserWarning: Dynamo does not know how to trace the builtin `cuda_utils.get_device_properties.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
+If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
+If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
+  torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
+[rank4]: Traceback (most recent call last):
+[rank4]:   File "<frozen runpy>", line 198, in _run_module_as_main
+[rank4]:   File "<frozen runpy>", line 88, in _run_code
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py", line 616, in <module>
+[rank4]:     main(config)
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
+[rank4]:     return f(*args, **kwargs)
+[rank4]:            ^^^^^^^^^^^^^^^^^^
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py", line 488, in main
+[rank4]:     output = model(
+[rank4]:              ^^^^^^
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
+[rank4]:     return self._call_impl(*args, **kwargs)
+[rank4]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
+[rank4]:     return inner()
+[rank4]:            ^^^^^^^
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1805, in inner
+[rank4]:     result = forward_call(*args, **kwargs)
+[rank4]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
+[rank4]:     return func(*args, **kwargs)
+[rank4]:            ^^^^^^^^^^^^^^^^^^^^^
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/gated_deltanet/modeling_gated_deltanet.py", line 424, in forward
+[rank4]:     outputs = self.model(
+[rank4]:               ^^^^^^^^^^^
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
+[rank4]:     return self._call_impl(*args, **kwargs)
+[rank4]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
+[rank4]:     return forward_call(*args, **kwargs)
+[rank4]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/gated_deltanet/modeling_gated_deltanet.py", line 294, in forward
+[rank4]:     hidden_states, attentions, past_key_values = layer(
+[rank4]:                                                  ^^^^^^
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
+[rank4]:     return self._call_impl(*args, **kwargs)
+[rank4]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
+[rank4]:     return inner()
+[rank4]:            ^^^^^^^
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1805, in inner
+[rank4]:     result = forward_call(*args, **kwargs)
+[rank4]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 655, in _fn
+[rank4]:     return fn(*args, **kwargs)
+[rank4]:            ^^^^^^^^^^^^^^^^^^^
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
+[rank4]:     return self._call_impl(*args, **kwargs)
+[rank4]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
+[rank4]:     return forward_call(*args, **kwargs)
+[rank4]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/gated_deltanet/modeling_gated_deltanet.py", line 108, in forward
+[rank4]:     hidden_states = self.attn_norm(hidden_states)
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/gated_deltanet/modeling_gated_deltanet.py", line 109, in torch_dynamo_resume_in_forward_at_108
+[rank4]:     hidden_states, attentions, past_key_values = self.attn(
+[rank4]:                                                  ^^^^^^^^^^
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
+[rank4]:     return self._call_impl(*args, **kwargs)
+[rank4]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
+[rank4]:     return forward_call(*args, **kwargs)
+[rank4]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/nsa.py", line 108, in forward
+[rank4]:     q, k = self.rotary(q, k, seqlen_offset=seqlen_offset, max_seqlen=max_seqlen, cu_seqlens=cu_seqlens)
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/nsa.py", line 123, in torch_dynamo_resume_in_forward_at_108
+[rank4]:     o = parallel_nsa(
+[rank4]:         ^^^^^^^^^^^^^
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/ops/nsa/parallel.py", line 838, in parallel_nsa
+[rank4]:     o_cmp, lse_cmp = parallel_nsa_compression(
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/ops/nsa/parallel.py", line 857, in torch_dynamo_resume_in_parallel_nsa_at_838
+[rank4]:     o = o_slc = ParallelNSAFunction.apply(q, k, v, block_indices, block_counts, block_size, scale, cu_seqlens)
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 1432, in __call__
+[rank4]:     return self._torchdynamo_orig_callable(
+[rank4]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 1213, in __call__
+[rank4]:     result = self._inner_convert(
+[rank4]:              ^^^^^^^^^^^^^^^^^^^^
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 598, in __call__
+[rank4]:     return _compile(
+[rank4]:            ^^^^^^^^^
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 1110, in _compile
+[rank4]:     raise InternalTorchDynamoError(
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 1059, in _compile
+[rank4]:     guarded_code = compile_inner(code, one_graph, hooks, transform)
+[rank4]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_utils_internal.py", line 97, in wrapper_function
+[rank4]:     return function(*args, **kwargs)
+[rank4]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 761, in compile_inner
+[rank4]:     return _compile_inner(code, one_graph, hooks, transform)
+[rank4]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 797, in _compile_inner
+[rank4]:     out_code = transform_code_object(code, transform)
+[rank4]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/bytecode_transformation.py", line 1422, in transform_code_object
+[rank4]:     transformations(instructions, code_options)
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 257, in _fn
+[rank4]:     return fn(*args, **kwargs)
+[rank4]:            ^^^^^^^^^^^^^^^^^^^
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 715, in transform
+[rank4]:     tracer.run()
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 3498, in run
+[rank4]:     super().run()
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 1337, in run
+[rank4]:     while self.step():
+[rank4]:           ^^^^^^^^^^^
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 1246, in step
+[rank4]:     self.dispatch_table[inst.opcode](self, inst)
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 2157, in COMPARE_OP
+[rank4]:     self.push(compare_op_handlers[inst.argval](self, self.popn(2), {}))
+[rank4]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 1111, in call_function
+[rank4]:     return handler(tx, args, kwargs)
+[rank4]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 789, in <lambda>
+[rank4]:     return lambda tx, args, kwargs: obj.call_function(
+[rank4]:                                     ^^^^^^^^^^^^^^^^^^
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 1111, in call_function
+[rank4]:     return handler(tx, args, kwargs)
+[rank4]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 945, in builtin_dispatch
+[rank4]:     rv = fn(tx, args, kwargs)
+[rank4]:          ^^^^^^^^^^^^^^^^^^^^
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 839, in call_binop_handlers
+[rank4]:     rv = fn(tx, *args)
+[rank4]:          ^^^^^^^^^^^^^
+[rank4]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 533, in compare_by_value
+[rank4]:     return ConstantVariable(op(a.value, b.value))
+[rank4]:                             ^^^^^^^^^^^^^^^^^^^^
+[rank4]: torch._dynamo.exc.InternalTorchDynamoError: TypeError: '>' not supported between instances of 'NoneType' and 'int'
+[rank4]: from user code:
+[rank4]:    File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/ops/nsa/parallel.py", line 862, in torch_dynamo_resume_in_parallel_nsa_at_857
+[rank4]:     if window_size > 0:
+[rank4]: Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/4/stdout.log ADDED Viewed

File without changes

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/5/error.json ADDED Viewed

	@@ -0,0 +1 @@

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/5/stderr.log ADDED Viewed

	@@ -0,0 +1,571 @@

+[titan] 2025-07-24 18:19:06,977 - root - INFO - Starting job: default job
+[titan] 2025-07-24 18:19:06,977 - root - INFO - [32m{
+  "activation_checkpoint": {
+    "mode": "none",
+    "selective_ac_option": "2"
+  },
+  "activation_offload": {
+    "mode": "none"
+  },
+  "checkpoint": {
+    "async_mode": "disabled",
+    "create_seed_checkpoint": false,
+    "enable_checkpoint": true,
+    "exclude_from_loading": [],
+    "export_dtype": "bfloat16",
+    "folder": "checkpoint",
+    "interval": 8192,
+    "interval_type": "steps",
+    "keep_latest_k": 100,
+    "load_step": -1,
+    "model_weights_only": false
+  },
+  "comm": {
+    "init_timeout_seconds": 300,
+    "trace_buf_size": 20000,
+    "train_timeout_seconds": 100
+  },
+  "experimental": {
+    "context_parallel_degree": 1,
+    "context_parallel_rotate_method": "allgather",
+    "custom_model_path": "",
+    "enable_async_tensor_parallel": false,
+    "enable_compiled_autograd": false,
+    "pipeline_parallel_degree": 1,
+    "pipeline_parallel_microbatches": null,
+    "pipeline_parallel_schedule": "1F1B",
+    "pipeline_parallel_schedule_csv": "",
+    "pipeline_parallel_split_points": []
+  },
+  "fault_tolerance": {
+    "enable": false,
+    "group_size": 0,
+    "min_replica_size": 1,
+    "replica_id": 0
+  },
+  "float8": {
+    "enable_fsdp_float8_all_gather": false,
+    "force_recompute_fp8_weight_in_bwd": false,
+    "precompute_float8_dynamic_scale_for_fsdp": false,
+    "recipe_name": null
+  },
+  "job": {
+    "config_file": "flame/models/fla.toml",
+    "description": "default job",
+    "dump_folder": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2",
+    "print_args": true,
+    "use_for_integration_test": false
+  },
+  "lr_scheduler": {
+    "decay_ratio": 1.0,
+    "decay_type": "linear",
+    "lr_min": 0.01,
+    "warmup_steps": 100
+  },
+  "memory_estimation": {
+    "disable_fake_mode": false,
+    "enabled": false
+  },
+  "metrics": {
+    "disable_color_printing": false,
+    "enable_tensorboard": true,
+    "enable_wandb": true,
+    "log_freq": 1,
+    "save_for_all_ranks": false,
+    "save_tb_folder": "tb"
+  },
+  "model": {
+    "config": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/configs/gdn_6_nsa_1_340M.json",
+    "converters": [],
+    "name": "fla",
+    "print_after_conversion": false,
+    "tokenizer_path": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/tokenizer"
+  },
+  "optimizer": {
+    "early_step_in_backward": false,
+    "eps": 1e-08,
+    "implementation": "fused",
+    "lr": 0.0003,
+    "name": "AdamW"
+  },
+  "profiling": {
+    "enable_memory_snapshot": false,
+    "enable_profiling": true,
+    "profile_freq": 512,
+    "save_memory_snapshot_folder": "memory_snapshot",
+    "save_traces_folder": "profile_trace"
+  },
+  "training": {
+    "batch_size": 8,
+    "compile": true,
+    "context_len": 8192,
+    "data_dir": null,
+    "data_files": null,
+    "data_parallel_replicate_degree": 1,
+    "data_parallel_shard_degree": -1,
+    "data_probs": "0.55,0.3,0.15",
+    "dataset": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro",
+    "dataset_name": "default,default,default",
+    "dataset_split": "train,train,train",
+    "deterministic": false,
+    "disable_loss_parallel": false,
+    "enable_cpu_offload": false,
+    "fsdp_reshard_after_forward": "default",
+    "gc_freq": 50,
+    "gradient_accumulation_steps": 2,
+    "max_norm": 1.0,
+    "mixed_precision_param": "bfloat16",
+    "mixed_precision_reduce": "float32",
+    "num_workers": 32,
+    "persistent_workers": false,
+    "pin_memory": false,
+    "prefetch_factor": 2,
+    "seed": 42,
+    "seq_len": 8192,
+    "skip_nan_inf": true,
+    "steps": 95366,
+    "streaming": true,
+    "tensor_parallel_degree": 1,
+    "varlen": false
+  }
+}[39m
+[titan] 2025-07-24 18:19:06,978 - root - INFO - [GC] Initial GC collection. 0.00 seconds.
+[titan] 2025-07-24 18:19:07,669 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
+[titan] 2025-07-24 18:19:07,672 - root - INFO - CUDA capacity: NVIDIA H20 with 95.00GiB memory
+[titan] 2025-07-24 18:19:07,731 - root - WARNING - Peak flops undefined for: NVIDIA H20, fallback to A100
+[titan] 2025-07-24 18:19:07,731 - root - INFO - Peak FLOPS used for computing MFU: 3.120e+14
+[titan] 2025-07-24 18:19:07,731 - root - INFO - Building 1-D device mesh with ['dp_shard'], [8]
+[titan] 2025-07-24 18:19:07,745 - root - INFO - Loading tokenizer...
+[titan] 2025-07-24 18:19:07,847 - root - INFO - LlamaTokenizerFast(name_or_path='/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/tokenizer', vocab_size=32000, model_max_length=10000000000, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
+	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
+	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
+	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
+}
+)
+[titan] 2025-07-24 18:19:07,847 - root - INFO - Loading dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro:default,default,default
+[titan] 2025-07-24 18:19:07,962 - root - INFO - Subset [36m/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample:default (p = 0.550)[39m:
+IterableDataset({
+    features: ['text', 'id', 'dump', 'url', 'file_path', 'language', 'language_score', 'token_count', 'score', 'int_score'],
+    num_shards: 140
+})
+[titan] 2025-07-24 18:19:07,962 - root - INFO - Shuffling the dataset with seed 42
+[titan] 2025-07-24 18:19:07,962 - root - WARNING - [31mDataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample has insufficient shards (140). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.[39m
+[titan] 2025-07-24 18:20:11,070 - root - INFO - Subset [36m/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged:default (p = 0.300)[39m:
+IterableDataset({
+    features: ['repo', 'content'],
+    num_shards: 1
+})
+[titan] 2025-07-24 18:20:11,070 - root - INFO - Shuffling the dataset with seed 42
+[titan] 2025-07-24 18:20:11,070 - root - WARNING - [31mDataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged has insufficient shards (1). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.[39m
+[titan] 2025-07-24 18:20:11,445 - root - INFO - Subset [36m/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro:default (p = 0.150)[39m:
+IterableDataset({
+    features: ['text', 'cc-path', 'domain', 'lang', 'lang_score', 'timestamp', 'url', 'math_score'],
+    num_shards: 100
+})
+[titan] 2025-07-24 18:20:11,445 - root - INFO - Shuffling the dataset with seed 42
+[titan] 2025-07-24 18:20:11,445 - root - WARNING - [31mDataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro has insufficient shards (100). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.[39m
+[titan] 2025-07-24 18:20:18,139 - root - INFO - Interleaving 3 datasets with probabilities [0.55, 0.3, 0.15]
+[titan] 2025-07-24 18:20:18,885 - root - INFO - IterableDataset({
+    features: ['text', 'content'],
+    num_shards: 256
+})
+[titan] 2025-07-24 18:20:19,030 - root - INFO - Building dataloader...
+[titan] 2025-07-24 18:20:19,033 - root - INFO - Loading model config from /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/configs/gdn_6_nsa_1_340M.json
+[titan] 2025-07-24 18:20:19,035 - root - INFO - Building model from the config
+[32mGatedDeltaNetConfig {
+  "allow_neg_eigval": false,
+  "architectures": [
+    "GatedDeltaNetForCausalLM"
+  ],
+  "attn": {
+    "block_counts": 16,
+    "block_size": 64,
+    "layers": [
+      5,
+      11,
+      17,
+      23
+    ],
+    "num_heads": 32,
+    "num_kv_heads": 2,
+    "qkv_bias": false,
+    "rope_theta": 160000.0,
+    "type": "nsa",
+    "window_size": null
+  },
+  "attn_mode": "chunk",
+  "bos_token_id": 1,
+  "conv_size": 4,
+  "eos_token_id": 2,
+  "expand_k": 1,
+  "expand_v": 1,
+  "fuse_cross_entropy": true,
+  "fuse_norm": true,
+  "fuse_swiglu": true,
+  "head_dim": 256,
+  "hidden_act": "swish",
+  "hidden_ratio": 4,
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "intermediate_size": null,
+  "max_position_embeddings": 8192,
+  "model_type": "gated_deltanet",
+  "norm_eps": 1e-06,
+  "norm_first": false,
+  "num_heads": 4,
+  "num_hidden_layers": 24,
+  "num_v_heads": null,
+  "qk_activation": "silu",
+  "qk_norm": "l2",
+  "tie_word_embeddings": false,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.53.3",
+  "use_beta": true,
+  "use_cache": true,
+  "use_gate": true,
+  "use_l2warp": false,
+  "use_output_norm": true,
+  "use_short_conv": true,
+  "vocab_size": 32000
+}
+[39m
+[titan] 2025-07-24 18:20:19,380 - root - INFO - [34m
+GatedDeltaNetForCausalLM(
+  (model): GatedDeltaNetModel(
+    (embeddings): Embedding(32000, 1024)
+    (layers): ModuleList(
+      (0-4): 5 x GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): GatedDeltaNet(
+          (q_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (a_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (b_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (q_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (k_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (v_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (g_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (o_norm): FusedRMSNormGated(256, eps=1e-06, activation=swish)
+          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (5): GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): NativeSparseAttention(
+          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (g_proj): Linear(in_features=1024, out_features=96, bias=False)
+          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
+          (rotary): RotaryEmbedding(dim=64, base=160000.0, interleaved=False, pos_idx_in_fp32=True)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (6-10): 5 x GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): GatedDeltaNet(
+          (q_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (a_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (b_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (q_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (k_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (v_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (g_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (o_norm): FusedRMSNormGated(256, eps=1e-06, activation=swish)
+          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (11): GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): NativeSparseAttention(
+          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (g_proj): Linear(in_features=1024, out_features=96, bias=False)
+          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
+          (rotary): RotaryEmbedding(dim=64, base=160000.0, interleaved=False, pos_idx_in_fp32=True)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (12-16): 5 x GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): GatedDeltaNet(
+          (q_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (a_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (b_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (q_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (k_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (v_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (g_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (o_norm): FusedRMSNormGated(256, eps=1e-06, activation=swish)
+          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (17): GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): NativeSparseAttention(
+          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (g_proj): Linear(in_features=1024, out_features=96, bias=False)
+          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
+          (rotary): RotaryEmbedding(dim=64, base=160000.0, interleaved=False, pos_idx_in_fp32=True)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (18-22): 5 x GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): GatedDeltaNet(
+          (q_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (a_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (b_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (q_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (k_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (v_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (g_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (o_norm): FusedRMSNormGated(256, eps=1e-06, activation=swish)
+          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (23): GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): NativeSparseAttention(
+          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (g_proj): Linear(in_features=1024, out_features=96, bias=False)
+          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
+          (rotary): RotaryEmbedding(dim=64, base=160000.0, interleaved=False, pos_idx_in_fp32=True)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+    )
+    (norm): RMSNorm(1024, eps=1e-06)
+  )
+  (lm_head): Linear(in_features=1024, out_features=32000, bias=False)
+  (criterion): FusedLinearCrossEntropyLoss()
+)[39m
+[titan] 2025-07-24 18:20:19,418 - root - INFO - Compiling each block with torch.compile
+[titan] 2025-07-24 18:20:19,418 - root - INFO - Compiling the embedding, norm, and lm_head layers with torch.compile
+[titan] 2025-07-24 18:20:19,419 - root - INFO - Compiling the entire model with torch.compile
+[titan] 2025-07-24 18:20:19,503 - root - INFO - Applied FSDP to the model
+[titan] 2025-07-24 18:20:19,588 - fla.models.gated_deltanet.modeling_gated_deltanet - WARNING - `A_log` is a DTensor, skipping initialization
+[titan] 2025-07-24 18:20:19,588 - fla.models.gated_deltanet.modeling_gated_deltanet - WARNING - `dt_bias` is a DTensor, skipping initialization
+[titan] 2025-07-24 18:20:19,684 - root - INFO - CUDA memory usage for model: 0.10GiB(0.10%)
+[titan] 2025-07-24 18:20:19,686 - root - WARNING - Warmup (100) + decay (95366) steps exceed total training steps (95366). Adjusting decay steps to 95266.
+[titan] 2025-07-24 18:20:19,829 - root - INFO - Checkpointing active. Checkpoints will be loaded from and saved to /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/checkpoint
+[titan] 2025-07-24 18:20:19,830 - root - INFO - CUDA capacity: NVIDIA H20 with 95.00GiB memory
+[titan] 2025-07-24 18:20:19,888 - root - WARNING - Peak flops undefined for: NVIDIA H20, fallback to A100
+[titan] 2025-07-24 18:20:28,483 - root - INFO - [31m***** Running training *****[39m
+[titan] 2025-07-24 18:20:28,485 - root - INFO - [32m  Training starts at step 1
+[titan] 2025-07-24 18:20:28,486 - root - INFO - [32m  Number of tokens per sequence = 8,192
+[titan] 2025-07-24 18:20:28,486 - root - INFO - [32m  Gradient Accumulation steps = 2
+[titan] 2025-07-24 18:20:28,487 - root - INFO - [32m  Instantaneous batch size (per device) = 8
+[titan] 2025-07-24 18:20:28,487 - root - INFO - [32m  Global batch size (w. parallel, distributed & accumulation) = 128 (1,048,576 tokens)
+[titan] 2025-07-24 18:20:28,487 - root - INFO - [32m  Total optimization steps = 95,366 (99,998,498,816 tokens)
+[titan] 2025-07-24 18:20:28,487 - root - INFO - [32m  Warmup steps = 100 (104,857,600 tokens)
+[titan] 2025-07-24 18:20:28,487 - root - INFO - [32m  Number of parameters = 396,695,712 [39m
+[titan] 2025-07-24 18:20:28,487 - root - INFO - Profiling active. Traces will be saved at /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/profile_trace
+/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/functions.py:1263: UserWarning: Dynamo does not know how to trace the builtin `cuda_utils.get_device_properties.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
+If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
+If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
+  torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
+[rank5]: Traceback (most recent call last):
+[rank5]:   File "<frozen runpy>", line 198, in _run_module_as_main
+[rank5]:   File "<frozen runpy>", line 88, in _run_code
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py", line 616, in <module>
+[rank5]:     main(config)
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
+[rank5]:     return f(*args, **kwargs)
+[rank5]:            ^^^^^^^^^^^^^^^^^^
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py", line 488, in main
+[rank5]:     output = model(
+[rank5]:              ^^^^^^
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
+[rank5]:     return self._call_impl(*args, **kwargs)
+[rank5]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
+[rank5]:     return inner()
+[rank5]:            ^^^^^^^
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1805, in inner
+[rank5]:     result = forward_call(*args, **kwargs)
+[rank5]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
+[rank5]:     return func(*args, **kwargs)
+[rank5]:            ^^^^^^^^^^^^^^^^^^^^^
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/gated_deltanet/modeling_gated_deltanet.py", line 424, in forward
+[rank5]:     outputs = self.model(
+[rank5]:               ^^^^^^^^^^^
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
+[rank5]:     return self._call_impl(*args, **kwargs)
+[rank5]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
+[rank5]:     return forward_call(*args, **kwargs)
+[rank5]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/gated_deltanet/modeling_gated_deltanet.py", line 294, in forward
+[rank5]:     hidden_states, attentions, past_key_values = layer(
+[rank5]:                                                  ^^^^^^
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
+[rank5]:     return self._call_impl(*args, **kwargs)
+[rank5]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
+[rank5]:     return inner()
+[rank5]:            ^^^^^^^
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1805, in inner
+[rank5]:     result = forward_call(*args, **kwargs)
+[rank5]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 655, in _fn
+[rank5]:     return fn(*args, **kwargs)
+[rank5]:            ^^^^^^^^^^^^^^^^^^^
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
+[rank5]:     return self._call_impl(*args, **kwargs)
+[rank5]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
+[rank5]:     return forward_call(*args, **kwargs)
+[rank5]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/gated_deltanet/modeling_gated_deltanet.py", line 108, in forward
+[rank5]:     hidden_states = self.attn_norm(hidden_states)
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/gated_deltanet/modeling_gated_deltanet.py", line 109, in torch_dynamo_resume_in_forward_at_108
+[rank5]:     hidden_states, attentions, past_key_values = self.attn(
+[rank5]:                                                  ^^^^^^^^^^
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
+[rank5]:     return self._call_impl(*args, **kwargs)
+[rank5]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
+[rank5]:     return forward_call(*args, **kwargs)
+[rank5]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/nsa.py", line 108, in forward
+[rank5]:     q, k = self.rotary(q, k, seqlen_offset=seqlen_offset, max_seqlen=max_seqlen, cu_seqlens=cu_seqlens)
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/nsa.py", line 123, in torch_dynamo_resume_in_forward_at_108
+[rank5]:     o = parallel_nsa(
+[rank5]:         ^^^^^^^^^^^^^
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/ops/nsa/parallel.py", line 838, in parallel_nsa
+[rank5]:     o_cmp, lse_cmp = parallel_nsa_compression(
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/ops/nsa/parallel.py", line 857, in torch_dynamo_resume_in_parallel_nsa_at_838
+[rank5]:     o = o_slc = ParallelNSAFunction.apply(q, k, v, block_indices, block_counts, block_size, scale, cu_seqlens)
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 1432, in __call__
+[rank5]:     return self._torchdynamo_orig_callable(
+[rank5]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 1213, in __call__
+[rank5]:     result = self._inner_convert(
+[rank5]:              ^^^^^^^^^^^^^^^^^^^^
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 598, in __call__
+[rank5]:     return _compile(
+[rank5]:            ^^^^^^^^^
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 1110, in _compile
+[rank5]:     raise InternalTorchDynamoError(
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 1059, in _compile
+[rank5]:     guarded_code = compile_inner(code, one_graph, hooks, transform)
+[rank5]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_utils_internal.py", line 97, in wrapper_function
+[rank5]:     return function(*args, **kwargs)
+[rank5]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 761, in compile_inner
+[rank5]:     return _compile_inner(code, one_graph, hooks, transform)
+[rank5]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 797, in _compile_inner
+[rank5]:     out_code = transform_code_object(code, transform)
+[rank5]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/bytecode_transformation.py", line 1422, in transform_code_object
+[rank5]:     transformations(instructions, code_options)
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 257, in _fn
+[rank5]:     return fn(*args, **kwargs)
+[rank5]:            ^^^^^^^^^^^^^^^^^^^
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 715, in transform
+[rank5]:     tracer.run()
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 3498, in run
+[rank5]:     super().run()
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 1337, in run
+[rank5]:     while self.step():
+[rank5]:           ^^^^^^^^^^^
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 1246, in step
+[rank5]:     self.dispatch_table[inst.opcode](self, inst)
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 2157, in COMPARE_OP
+[rank5]:     self.push(compare_op_handlers[inst.argval](self, self.popn(2), {}))
+[rank5]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 1111, in call_function
+[rank5]:     return handler(tx, args, kwargs)
+[rank5]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 789, in <lambda>
+[rank5]:     return lambda tx, args, kwargs: obj.call_function(
+[rank5]:                                     ^^^^^^^^^^^^^^^^^^
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 1111, in call_function
+[rank5]:     return handler(tx, args, kwargs)
+[rank5]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 945, in builtin_dispatch
+[rank5]:     rv = fn(tx, args, kwargs)
+[rank5]:          ^^^^^^^^^^^^^^^^^^^^
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 839, in call_binop_handlers
+[rank5]:     rv = fn(tx, *args)
+[rank5]:          ^^^^^^^^^^^^^
+[rank5]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 533, in compare_by_value
+[rank5]:     return ConstantVariable(op(a.value, b.value))
+[rank5]:                             ^^^^^^^^^^^^^^^^^^^^
+[rank5]: torch._dynamo.exc.InternalTorchDynamoError: TypeError: '>' not supported between instances of 'NoneType' and 'int'
+[rank5]: from user code:
+[rank5]:    File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/ops/nsa/parallel.py", line 862, in torch_dynamo_resume_in_parallel_nsa_at_857
+[rank5]:     if window_size > 0:
+[rank5]: Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/5/stdout.log ADDED Viewed

File without changes

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/6/error.json ADDED Viewed

	@@ -0,0 +1 @@

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/6/stderr.log ADDED Viewed

	@@ -0,0 +1,571 @@

+[titan] 2025-07-24 18:19:07,121 - root - INFO - Starting job: default job
+[titan] 2025-07-24 18:19:07,121 - root - INFO - [32m{
+  "activation_checkpoint": {
+    "mode": "none",
+    "selective_ac_option": "2"
+  },
+  "activation_offload": {
+    "mode": "none"
+  },
+  "checkpoint": {
+    "async_mode": "disabled",
+    "create_seed_checkpoint": false,
+    "enable_checkpoint": true,
+    "exclude_from_loading": [],
+    "export_dtype": "bfloat16",
+    "folder": "checkpoint",
+    "interval": 8192,
+    "interval_type": "steps",
+    "keep_latest_k": 100,
+    "load_step": -1,
+    "model_weights_only": false
+  },
+  "comm": {
+    "init_timeout_seconds": 300,
+    "trace_buf_size": 20000,
+    "train_timeout_seconds": 100
+  },
+  "experimental": {
+    "context_parallel_degree": 1,
+    "context_parallel_rotate_method": "allgather",
+    "custom_model_path": "",
+    "enable_async_tensor_parallel": false,
+    "enable_compiled_autograd": false,
+    "pipeline_parallel_degree": 1,
+    "pipeline_parallel_microbatches": null,
+    "pipeline_parallel_schedule": "1F1B",
+    "pipeline_parallel_schedule_csv": "",
+    "pipeline_parallel_split_points": []
+  },
+  "fault_tolerance": {
+    "enable": false,
+    "group_size": 0,
+    "min_replica_size": 1,
+    "replica_id": 0
+  },
+  "float8": {
+    "enable_fsdp_float8_all_gather": false,
+    "force_recompute_fp8_weight_in_bwd": false,
+    "precompute_float8_dynamic_scale_for_fsdp": false,
+    "recipe_name": null
+  },
+  "job": {
+    "config_file": "flame/models/fla.toml",
+    "description": "default job",
+    "dump_folder": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2",
+    "print_args": true,
+    "use_for_integration_test": false
+  },
+  "lr_scheduler": {
+    "decay_ratio": 1.0,
+    "decay_type": "linear",
+    "lr_min": 0.01,
+    "warmup_steps": 100
+  },
+  "memory_estimation": {
+    "disable_fake_mode": false,
+    "enabled": false
+  },
+  "metrics": {
+    "disable_color_printing": false,
+    "enable_tensorboard": true,
+    "enable_wandb": true,
+    "log_freq": 1,
+    "save_for_all_ranks": false,
+    "save_tb_folder": "tb"
+  },
+  "model": {
+    "config": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/configs/gdn_6_nsa_1_340M.json",
+    "converters": [],
+    "name": "fla",
+    "print_after_conversion": false,
+    "tokenizer_path": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/tokenizer"
+  },
+  "optimizer": {
+    "early_step_in_backward": false,
+    "eps": 1e-08,
+    "implementation": "fused",
+    "lr": 0.0003,
+    "name": "AdamW"
+  },
+  "profiling": {
+    "enable_memory_snapshot": false,
+    "enable_profiling": true,
+    "profile_freq": 512,
+    "save_memory_snapshot_folder": "memory_snapshot",
+    "save_traces_folder": "profile_trace"
+  },
+  "training": {
+    "batch_size": 8,
+    "compile": true,
+    "context_len": 8192,
+    "data_dir": null,
+    "data_files": null,
+    "data_parallel_replicate_degree": 1,
+    "data_parallel_shard_degree": -1,
+    "data_probs": "0.55,0.3,0.15",
+    "dataset": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro",
+    "dataset_name": "default,default,default",
+    "dataset_split": "train,train,train",
+    "deterministic": false,
+    "disable_loss_parallel": false,
+    "enable_cpu_offload": false,
+    "fsdp_reshard_after_forward": "default",
+    "gc_freq": 50,
+    "gradient_accumulation_steps": 2,
+    "max_norm": 1.0,
+    "mixed_precision_param": "bfloat16",
+    "mixed_precision_reduce": "float32",
+    "num_workers": 32,
+    "persistent_workers": false,
+    "pin_memory": false,
+    "prefetch_factor": 2,
+    "seed": 42,
+    "seq_len": 8192,
+    "skip_nan_inf": true,
+    "steps": 95366,
+    "streaming": true,
+    "tensor_parallel_degree": 1,
+    "varlen": false
+  }
+}[39m
+[titan] 2025-07-24 18:19:07,122 - root - INFO - [GC] Initial GC collection. 0.00 seconds.
+[titan] 2025-07-24 18:19:07,697 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
+[titan] 2025-07-24 18:19:07,700 - root - INFO - CUDA capacity: NVIDIA H20 with 95.00GiB memory
+[titan] 2025-07-24 18:19:07,786 - root - WARNING - Peak flops undefined for: NVIDIA H20, fallback to A100
+[titan] 2025-07-24 18:19:07,786 - root - INFO - Peak FLOPS used for computing MFU: 3.120e+14
+[titan] 2025-07-24 18:19:07,786 - root - INFO - Building 1-D device mesh with ['dp_shard'], [8]
+[titan] 2025-07-24 18:19:07,848 - root - INFO - Loading tokenizer...
+[titan] 2025-07-24 18:19:07,970 - root - INFO - LlamaTokenizerFast(name_or_path='/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/tokenizer', vocab_size=32000, model_max_length=10000000000, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
+	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
+	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
+	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
+}
+)
+[titan] 2025-07-24 18:19:07,970 - root - INFO - Loading dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro:default,default,default
+[titan] 2025-07-24 18:19:08,091 - root - INFO - Subset [36m/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample:default (p = 0.550)[39m:
+IterableDataset({
+    features: ['text', 'id', 'dump', 'url', 'file_path', 'language', 'language_score', 'token_count', 'score', 'int_score'],
+    num_shards: 140
+})
+[titan] 2025-07-24 18:19:08,091 - root - INFO - Shuffling the dataset with seed 42
+[titan] 2025-07-24 18:19:08,091 - root - WARNING - [31mDataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample has insufficient shards (140). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.[39m
+[titan] 2025-07-24 18:20:06,911 - root - INFO - Subset [36m/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged:default (p = 0.300)[39m:
+IterableDataset({
+    features: ['repo', 'content'],
+    num_shards: 1
+})
+[titan] 2025-07-24 18:20:06,912 - root - INFO - Shuffling the dataset with seed 42
+[titan] 2025-07-24 18:20:06,912 - root - WARNING - [31mDataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged has insufficient shards (1). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.[39m
+[titan] 2025-07-24 18:20:07,259 - root - INFO - Subset [36m/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro:default (p = 0.150)[39m:
+IterableDataset({
+    features: ['text', 'cc-path', 'domain', 'lang', 'lang_score', 'timestamp', 'url', 'math_score'],
+    num_shards: 100
+})
+[titan] 2025-07-24 18:20:07,259 - root - INFO - Shuffling the dataset with seed 42
+[titan] 2025-07-24 18:20:07,259 - root - WARNING - [31mDataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro has insufficient shards (100). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.[39m
+[titan] 2025-07-24 18:20:13,871 - root - INFO - Interleaving 3 datasets with probabilities [0.55, 0.3, 0.15]
+[titan] 2025-07-24 18:20:14,713 - root - INFO - IterableDataset({
+    features: ['text', 'content'],
+    num_shards: 256
+})
+[titan] 2025-07-24 18:20:14,891 - root - INFO - Building dataloader...
+[titan] 2025-07-24 18:20:14,893 - root - INFO - Loading model config from /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/configs/gdn_6_nsa_1_340M.json
+[titan] 2025-07-24 18:20:14,895 - root - INFO - Building model from the config
+[32mGatedDeltaNetConfig {
+  "allow_neg_eigval": false,
+  "architectures": [
+    "GatedDeltaNetForCausalLM"
+  ],
+  "attn": {
+    "block_counts": 16,
+    "block_size": 64,
+    "layers": [
+      5,
+      11,
+      17,
+      23
+    ],
+    "num_heads": 32,
+    "num_kv_heads": 2,
+    "qkv_bias": false,
+    "rope_theta": 160000.0,
+    "type": "nsa",
+    "window_size": null
+  },
+  "attn_mode": "chunk",
+  "bos_token_id": 1,
+  "conv_size": 4,
+  "eos_token_id": 2,
+  "expand_k": 1,
+  "expand_v": 1,
+  "fuse_cross_entropy": true,
+  "fuse_norm": true,
+  "fuse_swiglu": true,
+  "head_dim": 256,
+  "hidden_act": "swish",
+  "hidden_ratio": 4,
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "intermediate_size": null,
+  "max_position_embeddings": 8192,
+  "model_type": "gated_deltanet",
+  "norm_eps": 1e-06,
+  "norm_first": false,
+  "num_heads": 4,
+  "num_hidden_layers": 24,
+  "num_v_heads": null,
+  "qk_activation": "silu",
+  "qk_norm": "l2",
+  "tie_word_embeddings": false,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.53.3",
+  "use_beta": true,
+  "use_cache": true,
+  "use_gate": true,
+  "use_l2warp": false,
+  "use_output_norm": true,
+  "use_short_conv": true,
+  "vocab_size": 32000
+}
+[39m
+[titan] 2025-07-24 18:20:15,243 - root - INFO - [34m
+GatedDeltaNetForCausalLM(
+  (model): GatedDeltaNetModel(
+    (embeddings): Embedding(32000, 1024)
+    (layers): ModuleList(
+      (0-4): 5 x GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): GatedDeltaNet(
+          (q_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (a_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (b_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (q_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (k_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (v_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (g_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (o_norm): FusedRMSNormGated(256, eps=1e-06, activation=swish)
+          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (5): GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): NativeSparseAttention(
+          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (g_proj): Linear(in_features=1024, out_features=96, bias=False)
+          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
+          (rotary): RotaryEmbedding(dim=64, base=160000.0, interleaved=False, pos_idx_in_fp32=True)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (6-10): 5 x GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): GatedDeltaNet(
+          (q_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (a_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (b_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (q_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (k_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (v_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (g_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (o_norm): FusedRMSNormGated(256, eps=1e-06, activation=swish)
+          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (11): GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): NativeSparseAttention(
+          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (g_proj): Linear(in_features=1024, out_features=96, bias=False)
+          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
+          (rotary): RotaryEmbedding(dim=64, base=160000.0, interleaved=False, pos_idx_in_fp32=True)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (12-16): 5 x GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): GatedDeltaNet(
+          (q_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (a_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (b_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (q_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (k_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (v_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (g_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (o_norm): FusedRMSNormGated(256, eps=1e-06, activation=swish)
+          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (17): GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): NativeSparseAttention(
+          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (g_proj): Linear(in_features=1024, out_features=96, bias=False)
+          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
+          (rotary): RotaryEmbedding(dim=64, base=160000.0, interleaved=False, pos_idx_in_fp32=True)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (18-22): 5 x GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): GatedDeltaNet(
+          (q_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (a_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (b_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (q_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (k_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (v_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (g_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (o_norm): FusedRMSNormGated(256, eps=1e-06, activation=swish)
+          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (23): GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): NativeSparseAttention(
+          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (g_proj): Linear(in_features=1024, out_features=96, bias=False)
+          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
+          (rotary): RotaryEmbedding(dim=64, base=160000.0, interleaved=False, pos_idx_in_fp32=True)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+    )
+    (norm): RMSNorm(1024, eps=1e-06)
+  )
+  (lm_head): Linear(in_features=1024, out_features=32000, bias=False)
+  (criterion): FusedLinearCrossEntropyLoss()
+)[39m
+[titan] 2025-07-24 18:20:15,278 - root - INFO - Compiling each block with torch.compile
+[titan] 2025-07-24 18:20:15,278 - root - INFO - Compiling the embedding, norm, and lm_head layers with torch.compile
+[titan] 2025-07-24 18:20:15,279 - root - INFO - Compiling the entire model with torch.compile
+[titan] 2025-07-24 18:20:15,360 - root - INFO - Applied FSDP to the model
+[titan] 2025-07-24 18:20:15,425 - fla.models.gated_deltanet.modeling_gated_deltanet - WARNING - `A_log` is a DTensor, skipping initialization
+[titan] 2025-07-24 18:20:15,426 - fla.models.gated_deltanet.modeling_gated_deltanet - WARNING - `dt_bias` is a DTensor, skipping initialization
+[titan] 2025-07-24 18:20:15,519 - root - INFO - CUDA memory usage for model: 0.10GiB(0.10%)
+[titan] 2025-07-24 18:20:15,520 - root - WARNING - Warmup (100) + decay (95366) steps exceed total training steps (95366). Adjusting decay steps to 95266.
+[titan] 2025-07-24 18:20:15,544 - root - INFO - Checkpointing active. Checkpoints will be loaded from and saved to /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/checkpoint
+[titan] 2025-07-24 18:20:15,544 - root - INFO - CUDA capacity: NVIDIA H20 with 95.00GiB memory
+[titan] 2025-07-24 18:20:15,603 - root - WARNING - Peak flops undefined for: NVIDIA H20, fallback to A100
+[titan] 2025-07-24 18:20:24,033 - root - INFO - [31m***** Running training *****[39m
+[titan] 2025-07-24 18:20:24,036 - root - INFO - [32m  Training starts at step 1
+[titan] 2025-07-24 18:20:24,036 - root - INFO - [32m  Number of tokens per sequence = 8,192
+[titan] 2025-07-24 18:20:24,037 - root - INFO - [32m  Gradient Accumulation steps = 2
+[titan] 2025-07-24 18:20:24,037 - root - INFO - [32m  Instantaneous batch size (per device) = 8
+[titan] 2025-07-24 18:20:24,037 - root - INFO - [32m  Global batch size (w. parallel, distributed & accumulation) = 128 (1,048,576 tokens)
+[titan] 2025-07-24 18:20:24,037 - root - INFO - [32m  Total optimization steps = 95,366 (99,998,498,816 tokens)
+[titan] 2025-07-24 18:20:24,037 - root - INFO - [32m  Warmup steps = 100 (104,857,600 tokens)
+[titan] 2025-07-24 18:20:24,037 - root - INFO - [32m  Number of parameters = 396,695,712 [39m
+[titan] 2025-07-24 18:20:24,037 - root - INFO - Profiling active. Traces will be saved at /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/profile_trace
+/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/functions.py:1263: UserWarning: Dynamo does not know how to trace the builtin `cuda_utils.get_device_properties.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
+If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
+If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
+  torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
+[rank6]: Traceback (most recent call last):
+[rank6]:   File "<frozen runpy>", line 198, in _run_module_as_main
+[rank6]:   File "<frozen runpy>", line 88, in _run_code
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py", line 616, in <module>
+[rank6]:     main(config)
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
+[rank6]:     return f(*args, **kwargs)
+[rank6]:            ^^^^^^^^^^^^^^^^^^
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py", line 488, in main
+[rank6]:     output = model(
+[rank6]:              ^^^^^^
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
+[rank6]:     return self._call_impl(*args, **kwargs)
+[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
+[rank6]:     return inner()
+[rank6]:            ^^^^^^^
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1805, in inner
+[rank6]:     result = forward_call(*args, **kwargs)
+[rank6]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
+[rank6]:     return func(*args, **kwargs)
+[rank6]:            ^^^^^^^^^^^^^^^^^^^^^
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/gated_deltanet/modeling_gated_deltanet.py", line 424, in forward
+[rank6]:     outputs = self.model(
+[rank6]:               ^^^^^^^^^^^
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
+[rank6]:     return self._call_impl(*args, **kwargs)
+[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
+[rank6]:     return forward_call(*args, **kwargs)
+[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/gated_deltanet/modeling_gated_deltanet.py", line 294, in forward
+[rank6]:     hidden_states, attentions, past_key_values = layer(
+[rank6]:                                                  ^^^^^^
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
+[rank6]:     return self._call_impl(*args, **kwargs)
+[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
+[rank6]:     return inner()
+[rank6]:            ^^^^^^^
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1805, in inner
+[rank6]:     result = forward_call(*args, **kwargs)
+[rank6]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 655, in _fn
+[rank6]:     return fn(*args, **kwargs)
+[rank6]:            ^^^^^^^^^^^^^^^^^^^
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
+[rank6]:     return self._call_impl(*args, **kwargs)
+[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
+[rank6]:     return forward_call(*args, **kwargs)
+[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/gated_deltanet/modeling_gated_deltanet.py", line 108, in forward
+[rank6]:     hidden_states = self.attn_norm(hidden_states)
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/gated_deltanet/modeling_gated_deltanet.py", line 109, in torch_dynamo_resume_in_forward_at_108
+[rank6]:     hidden_states, attentions, past_key_values = self.attn(
+[rank6]:                                                  ^^^^^^^^^^
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
+[rank6]:     return self._call_impl(*args, **kwargs)
+[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
+[rank6]:     return forward_call(*args, **kwargs)
+[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/nsa.py", line 108, in forward
+[rank6]:     q, k = self.rotary(q, k, seqlen_offset=seqlen_offset, max_seqlen=max_seqlen, cu_seqlens=cu_seqlens)
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/nsa.py", line 123, in torch_dynamo_resume_in_forward_at_108
+[rank6]:     o = parallel_nsa(
+[rank6]:         ^^^^^^^^^^^^^
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/ops/nsa/parallel.py", line 838, in parallel_nsa
+[rank6]:     o_cmp, lse_cmp = parallel_nsa_compression(
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/ops/nsa/parallel.py", line 857, in torch_dynamo_resume_in_parallel_nsa_at_838
+[rank6]:     o = o_slc = ParallelNSAFunction.apply(q, k, v, block_indices, block_counts, block_size, scale, cu_seqlens)
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 1432, in __call__
+[rank6]:     return self._torchdynamo_orig_callable(
+[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 1213, in __call__
+[rank6]:     result = self._inner_convert(
+[rank6]:              ^^^^^^^^^^^^^^^^^^^^
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 598, in __call__
+[rank6]:     return _compile(
+[rank6]:            ^^^^^^^^^
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 1110, in _compile
+[rank6]:     raise InternalTorchDynamoError(
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 1059, in _compile
+[rank6]:     guarded_code = compile_inner(code, one_graph, hooks, transform)
+[rank6]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_utils_internal.py", line 97, in wrapper_function
+[rank6]:     return function(*args, **kwargs)
+[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 761, in compile_inner
+[rank6]:     return _compile_inner(code, one_graph, hooks, transform)
+[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 797, in _compile_inner
+[rank6]:     out_code = transform_code_object(code, transform)
+[rank6]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/bytecode_transformation.py", line 1422, in transform_code_object
+[rank6]:     transformations(instructions, code_options)
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 257, in _fn
+[rank6]:     return fn(*args, **kwargs)
+[rank6]:            ^^^^^^^^^^^^^^^^^^^
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 715, in transform
+[rank6]:     tracer.run()
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 3498, in run
+[rank6]:     super().run()
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 1337, in run
+[rank6]:     while self.step():
+[rank6]:           ^^^^^^^^^^^
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 1246, in step
+[rank6]:     self.dispatch_table[inst.opcode](self, inst)
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 2157, in COMPARE_OP
+[rank6]:     self.push(compare_op_handlers[inst.argval](self, self.popn(2), {}))
+[rank6]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 1111, in call_function
+[rank6]:     return handler(tx, args, kwargs)
+[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 789, in <lambda>
+[rank6]:     return lambda tx, args, kwargs: obj.call_function(
+[rank6]:                                     ^^^^^^^^^^^^^^^^^^
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 1111, in call_function
+[rank6]:     return handler(tx, args, kwargs)
+[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 945, in builtin_dispatch
+[rank6]:     rv = fn(tx, args, kwargs)
+[rank6]:          ^^^^^^^^^^^^^^^^^^^^
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 839, in call_binop_handlers
+[rank6]:     rv = fn(tx, *args)
+[rank6]:          ^^^^^^^^^^^^^
+[rank6]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 533, in compare_by_value
+[rank6]:     return ConstantVariable(op(a.value, b.value))
+[rank6]:                             ^^^^^^^^^^^^^^^^^^^^
+[rank6]: torch._dynamo.exc.InternalTorchDynamoError: TypeError: '>' not supported between instances of 'NoneType' and 'int'
+[rank6]: from user code:
+[rank6]:    File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/ops/nsa/parallel.py", line 862, in torch_dynamo_resume_in_parallel_nsa_at_857
+[rank6]:     if window_size > 0:
+[rank6]: Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/6/stdout.log ADDED Viewed

File without changes

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/7/error.json ADDED Viewed

	@@ -0,0 +1 @@

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/7/stderr.log ADDED Viewed

	@@ -0,0 +1,571 @@

+[titan] 2025-07-24 18:19:06,088 - root - INFO - Starting job: default job
+[titan] 2025-07-24 18:19:06,088 - root - INFO - [32m{
+  "activation_checkpoint": {
+    "mode": "none",
+    "selective_ac_option": "2"
+  },
+  "activation_offload": {
+    "mode": "none"
+  },
+  "checkpoint": {
+    "async_mode": "disabled",
+    "create_seed_checkpoint": false,
+    "enable_checkpoint": true,
+    "exclude_from_loading": [],
+    "export_dtype": "bfloat16",
+    "folder": "checkpoint",
+    "interval": 8192,
+    "interval_type": "steps",
+    "keep_latest_k": 100,
+    "load_step": -1,
+    "model_weights_only": false
+  },
+  "comm": {
+    "init_timeout_seconds": 300,
+    "trace_buf_size": 20000,
+    "train_timeout_seconds": 100
+  },
+  "experimental": {
+    "context_parallel_degree": 1,
+    "context_parallel_rotate_method": "allgather",
+    "custom_model_path": "",
+    "enable_async_tensor_parallel": false,
+    "enable_compiled_autograd": false,
+    "pipeline_parallel_degree": 1,
+    "pipeline_parallel_microbatches": null,
+    "pipeline_parallel_schedule": "1F1B",
+    "pipeline_parallel_schedule_csv": "",
+    "pipeline_parallel_split_points": []
+  },
+  "fault_tolerance": {
+    "enable": false,
+    "group_size": 0,
+    "min_replica_size": 1,
+    "replica_id": 0
+  },
+  "float8": {
+    "enable_fsdp_float8_all_gather": false,
+    "force_recompute_fp8_weight_in_bwd": false,
+    "precompute_float8_dynamic_scale_for_fsdp": false,
+    "recipe_name": null
+  },
+  "job": {
+    "config_file": "flame/models/fla.toml",
+    "description": "default job",
+    "dump_folder": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2",
+    "print_args": true,
+    "use_for_integration_test": false
+  },
+  "lr_scheduler": {
+    "decay_ratio": 1.0,
+    "decay_type": "linear",
+    "lr_min": 0.01,
+    "warmup_steps": 100
+  },
+  "memory_estimation": {
+    "disable_fake_mode": false,
+    "enabled": false
+  },
+  "metrics": {
+    "disable_color_printing": false,
+    "enable_tensorboard": true,
+    "enable_wandb": true,
+    "log_freq": 1,
+    "save_for_all_ranks": false,
+    "save_tb_folder": "tb"
+  },
+  "model": {
+    "config": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/configs/gdn_6_nsa_1_340M.json",
+    "converters": [],
+    "name": "fla",
+    "print_after_conversion": false,
+    "tokenizer_path": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/tokenizer"
+  },
+  "optimizer": {
+    "early_step_in_backward": false,
+    "eps": 1e-08,
+    "implementation": "fused",
+    "lr": 0.0003,
+    "name": "AdamW"
+  },
+  "profiling": {
+    "enable_memory_snapshot": false,
+    "enable_profiling": true,
+    "profile_freq": 512,
+    "save_memory_snapshot_folder": "memory_snapshot",
+    "save_traces_folder": "profile_trace"
+  },
+  "training": {
+    "batch_size": 8,
+    "compile": true,
+    "context_len": 8192,
+    "data_dir": null,
+    "data_files": null,
+    "data_parallel_replicate_degree": 1,
+    "data_parallel_shard_degree": -1,
+    "data_probs": "0.55,0.3,0.15",
+    "dataset": "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro",
+    "dataset_name": "default,default,default",
+    "dataset_split": "train,train,train",
+    "deterministic": false,
+    "disable_loss_parallel": false,
+    "enable_cpu_offload": false,
+    "fsdp_reshard_after_forward": "default",
+    "gc_freq": 50,
+    "gradient_accumulation_steps": 2,
+    "max_norm": 1.0,
+    "mixed_precision_param": "bfloat16",
+    "mixed_precision_reduce": "float32",
+    "num_workers": 32,
+    "persistent_workers": false,
+    "pin_memory": false,
+    "prefetch_factor": 2,
+    "seed": 42,
+    "seq_len": 8192,
+    "skip_nan_inf": true,
+    "steps": 95366,
+    "streaming": true,
+    "tensor_parallel_degree": 1,
+    "varlen": false
+  }
+}[39m
+[titan] 2025-07-24 18:19:06,089 - root - INFO - [GC] Initial GC collection. 0.00 seconds.
+[titan] 2025-07-24 18:19:06,815 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
+[titan] 2025-07-24 18:19:06,817 - root - INFO - CUDA capacity: NVIDIA H20 with 95.00GiB memory
+[titan] 2025-07-24 18:19:06,874 - root - WARNING - Peak flops undefined for: NVIDIA H20, fallback to A100
+[titan] 2025-07-24 18:19:06,874 - root - INFO - Peak FLOPS used for computing MFU: 3.120e+14
+[titan] 2025-07-24 18:19:06,875 - root - INFO - Building 1-D device mesh with ['dp_shard'], [8]
+[titan] 2025-07-24 18:19:06,881 - root - INFO - Loading tokenizer...
+[titan] 2025-07-24 18:19:06,985 - root - INFO - LlamaTokenizerFast(name_or_path='/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/tokenizer', vocab_size=32000, model_max_length=10000000000, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
+	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
+	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
+	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
+}
+)
+[titan] 2025-07-24 18:19:06,986 - root - INFO - Loading dataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged,/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro:default,default,default
+[titan] 2025-07-24 18:19:07,101 - root - INFO - Subset [36m/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample:default (p = 0.550)[39m:
+IterableDataset({
+    features: ['text', 'id', 'dump', 'url', 'file_path', 'language', 'language_score', 'token_count', 'score', 'int_score'],
+    num_shards: 140
+})
+[titan] 2025-07-24 18:19:07,101 - root - INFO - Shuffling the dataset with seed 42
+[titan] 2025-07-24 18:19:07,101 - root - WARNING - [31mDataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/fineweb-edu-sample has insufficient shards (140). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.[39m
+[titan] 2025-07-24 18:20:11,991 - root - INFO - Subset [36m/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged:default (p = 0.300)[39m:
+IterableDataset({
+    features: ['repo', 'content'],
+    num_shards: 1
+})
+[titan] 2025-07-24 18:20:11,991 - root - INFO - Shuffling the dataset with seed 42
+[titan] 2025-07-24 18:20:11,992 - root - WARNING - [31mDataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/small_repos_20B_sample_merged has insufficient shards (1). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.[39m
+[titan] 2025-07-24 18:20:12,410 - root - INFO - Subset [36m/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro:default (p = 0.150)[39m:
+IterableDataset({
+    features: ['text', 'cc-path', 'domain', 'lang', 'lang_score', 'timestamp', 'url', 'math_score'],
+    num_shards: 100
+})
+[titan] 2025-07-24 18:20:12,411 - root - INFO - Shuffling the dataset with seed 42
+[titan] 2025-07-24 18:20:12,411 - root - WARNING - [31mDataset /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/dataset/megamath-web-pro has insufficient shards (100). Need 256 shards minimum for desired data parallel workers × 32 dataloader workers. Resharding dataset to 256 shards and disabling streaming mode.[39m
+[titan] 2025-07-24 18:20:19,060 - root - INFO - Interleaving 3 datasets with probabilities [0.55, 0.3, 0.15]
+[titan] 2025-07-24 18:20:19,771 - root - INFO - IterableDataset({
+    features: ['text', 'content'],
+    num_shards: 256
+})
+[titan] 2025-07-24 18:20:19,895 - root - INFO - Building dataloader...
+[titan] 2025-07-24 18:20:19,898 - root - INFO - Loading model config from /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/configs/gdn_6_nsa_1_340M.json
+[titan] 2025-07-24 18:20:19,900 - root - INFO - Building model from the config
+[32mGatedDeltaNetConfig {
+  "allow_neg_eigval": false,
+  "architectures": [
+    "GatedDeltaNetForCausalLM"
+  ],
+  "attn": {
+    "block_counts": 16,
+    "block_size": 64,
+    "layers": [
+      5,
+      11,
+      17,
+      23
+    ],
+    "num_heads": 32,
+    "num_kv_heads": 2,
+    "qkv_bias": false,
+    "rope_theta": 160000.0,
+    "type": "nsa",
+    "window_size": null
+  },
+  "attn_mode": "chunk",
+  "bos_token_id": 1,
+  "conv_size": 4,
+  "eos_token_id": 2,
+  "expand_k": 1,
+  "expand_v": 1,
+  "fuse_cross_entropy": true,
+  "fuse_norm": true,
+  "fuse_swiglu": true,
+  "head_dim": 256,
+  "hidden_act": "swish",
+  "hidden_ratio": 4,
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "intermediate_size": null,
+  "max_position_embeddings": 8192,
+  "model_type": "gated_deltanet",
+  "norm_eps": 1e-06,
+  "norm_first": false,
+  "num_heads": 4,
+  "num_hidden_layers": 24,
+  "num_v_heads": null,
+  "qk_activation": "silu",
+  "qk_norm": "l2",
+  "tie_word_embeddings": false,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.53.3",
+  "use_beta": true,
+  "use_cache": true,
+  "use_gate": true,
+  "use_l2warp": false,
+  "use_output_norm": true,
+  "use_short_conv": true,
+  "vocab_size": 32000
+}
+[39m
+[titan] 2025-07-24 18:20:20,253 - root - INFO - [34m
+GatedDeltaNetForCausalLM(
+  (model): GatedDeltaNetModel(
+    (embeddings): Embedding(32000, 1024)
+    (layers): ModuleList(
+      (0-4): 5 x GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): GatedDeltaNet(
+          (q_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (a_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (b_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (q_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (k_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (v_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (g_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (o_norm): FusedRMSNormGated(256, eps=1e-06, activation=swish)
+          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (5): GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): NativeSparseAttention(
+          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (g_proj): Linear(in_features=1024, out_features=96, bias=False)
+          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
+          (rotary): RotaryEmbedding(dim=64, base=160000.0, interleaved=False, pos_idx_in_fp32=True)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (6-10): 5 x GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): GatedDeltaNet(
+          (q_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (a_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (b_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (q_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (k_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (v_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (g_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (o_norm): FusedRMSNormGated(256, eps=1e-06, activation=swish)
+          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (11): GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): NativeSparseAttention(
+          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (g_proj): Linear(in_features=1024, out_features=96, bias=False)
+          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
+          (rotary): RotaryEmbedding(dim=64, base=160000.0, interleaved=False, pos_idx_in_fp32=True)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (12-16): 5 x GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): GatedDeltaNet(
+          (q_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (a_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (b_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (q_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (k_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (v_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (g_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (o_norm): FusedRMSNormGated(256, eps=1e-06, activation=swish)
+          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (17): GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): NativeSparseAttention(
+          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (g_proj): Linear(in_features=1024, out_features=96, bias=False)
+          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
+          (rotary): RotaryEmbedding(dim=64, base=160000.0, interleaved=False, pos_idx_in_fp32=True)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (18-22): 5 x GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): GatedDeltaNet(
+          (q_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (a_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (b_proj): Linear(in_features=1024, out_features=4, bias=False)
+          (q_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (k_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (v_conv1d): ShortConvolution(1024, 1024, kernel_size=(4,), stride=(1,), padding=(3,), groups=1024, bias=False, activation=silu, backend=cuda)
+          (g_proj): Linear(in_features=1024, out_features=1024, bias=False)
+          (o_norm): FusedRMSNormGated(256, eps=1e-06, activation=swish)
+          (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+      (23): GatedDeltaNetBlock(
+        (attn_norm): RMSNorm(1024, eps=1e-06)
+        (attn): NativeSparseAttention(
+          (q_proj): Linear(in_features=1024, out_features=2048, bias=False)
+          (k_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (v_proj): Linear(in_features=1024, out_features=128, bias=False)
+          (g_proj): Linear(in_features=1024, out_features=96, bias=False)
+          (o_proj): Linear(in_features=2048, out_features=1024, bias=False)
+          (rotary): RotaryEmbedding(dim=64, base=160000.0, interleaved=False, pos_idx_in_fp32=True)
+        )
+        (mlp_norm): RMSNorm(1024, eps=1e-06)
+        (mlp): GatedMLP(
+          (gate_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (up_proj): Linear(in_features=1024, out_features=2816, bias=False)
+          (down_proj): Linear(in_features=2816, out_features=1024, bias=False)
+          (swiglu_linear): SwiGLULinear()
+        )
+      )
+    )
+    (norm): RMSNorm(1024, eps=1e-06)
+  )
+  (lm_head): Linear(in_features=1024, out_features=32000, bias=False)
+  (criterion): FusedLinearCrossEntropyLoss()
+)[39m
+[titan] 2025-07-24 18:20:20,288 - root - INFO - Compiling each block with torch.compile
+[titan] 2025-07-24 18:20:20,288 - root - INFO - Compiling the embedding, norm, and lm_head layers with torch.compile
+[titan] 2025-07-24 18:20:20,289 - root - INFO - Compiling the entire model with torch.compile
+[titan] 2025-07-24 18:20:20,376 - root - INFO - Applied FSDP to the model
+[titan] 2025-07-24 18:20:20,441 - fla.models.gated_deltanet.modeling_gated_deltanet - WARNING - `A_log` is a DTensor, skipping initialization
+[titan] 2025-07-24 18:20:20,442 - fla.models.gated_deltanet.modeling_gated_deltanet - WARNING - `dt_bias` is a DTensor, skipping initialization
+[titan] 2025-07-24 18:20:20,539 - root - INFO - CUDA memory usage for model: 0.10GiB(0.10%)
+[titan] 2025-07-24 18:20:20,541 - root - WARNING - Warmup (100) + decay (95366) steps exceed total training steps (95366). Adjusting decay steps to 95266.
+[titan] 2025-07-24 18:20:20,564 - root - INFO - Checkpointing active. Checkpoints will be loaded from and saved to /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/checkpoint
+[titan] 2025-07-24 18:20:20,565 - root - INFO - CUDA capacity: NVIDIA H20 with 95.00GiB memory
+[titan] 2025-07-24 18:20:20,622 - root - WARNING - Peak flops undefined for: NVIDIA H20, fallback to A100
+[titan] 2025-07-24 18:20:29,001 - root - INFO - [31m***** Running training *****[39m
+[titan] 2025-07-24 18:20:29,002 - root - INFO - [32m  Training starts at step 1
+[titan] 2025-07-24 18:20:29,002 - root - INFO - [32m  Number of tokens per sequence = 8,192
+[titan] 2025-07-24 18:20:29,002 - root - INFO - [32m  Gradient Accumulation steps = 2
+[titan] 2025-07-24 18:20:29,004 - root - INFO - [32m  Instantaneous batch size (per device) = 8
+[titan] 2025-07-24 18:20:29,005 - root - INFO - [32m  Global batch size (w. parallel, distributed & accumulation) = 128 (1,048,576 tokens)
+[titan] 2025-07-24 18:20:29,005 - root - INFO - [32m  Total optimization steps = 95,366 (99,998,498,816 tokens)
+[titan] 2025-07-24 18:20:29,005 - root - INFO - [32m  Warmup steps = 100 (104,857,600 tokens)
+[titan] 2025-07-24 18:20:29,005 - root - INFO - [32m  Number of parameters = 396,695,712 [39m
+[titan] 2025-07-24 18:20:29,006 - root - INFO - Profiling active. Traces will be saved at /mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/pretrain-linear-moe/flame/exp/bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/profile_trace
+/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/functions.py:1263: UserWarning: Dynamo does not know how to trace the builtin `cuda_utils.get_device_properties.` This function is either a Python builtin (e.g. _warnings.warn) or a third-party C/C++ Python extension (perhaps created with pybind).
+If it is a Python builtin, please file an issue on GitHub so the PyTorch team can add support for it and see the next case for a workaround.
+If it is a third-party C/C++ Python extension, please either wrap it into a PyTorch-understood custom operator (see https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html for more details) or, if it is traceable, use `torch.compiler.allow_in_graph`.
+  torch._dynamo.utils.warn_once(explanation + "\n" + "\n".join(hints))
+[rank7]: Traceback (most recent call last):
+[rank7]:   File "<frozen runpy>", line 198, in _run_module_as_main
+[rank7]:   File "<frozen runpy>", line 88, in _run_code
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py", line 616, in <module>
+[rank7]:     main(config)
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
+[rank7]:     return f(*args, **kwargs)
+[rank7]:            ^^^^^^^^^^^^^^^^^^
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/flame/train.py", line 488, in main
+[rank7]:     output = model(
+[rank7]:              ^^^^^^
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
+[rank7]:     return self._call_impl(*args, **kwargs)
+[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
+[rank7]:     return inner()
+[rank7]:            ^^^^^^^
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1805, in inner
+[rank7]:     result = forward_call(*args, **kwargs)
+[rank7]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
+[rank7]:     return func(*args, **kwargs)
+[rank7]:            ^^^^^^^^^^^^^^^^^^^^^
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/gated_deltanet/modeling_gated_deltanet.py", line 424, in forward
+[rank7]:     outputs = self.model(
+[rank7]:               ^^^^^^^^^^^
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
+[rank7]:     return self._call_impl(*args, **kwargs)
+[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
+[rank7]:     return forward_call(*args, **kwargs)
+[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/gated_deltanet/modeling_gated_deltanet.py", line 294, in forward
+[rank7]:     hidden_states, attentions, past_key_values = layer(
+[rank7]:                                                  ^^^^^^
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
+[rank7]:     return self._call_impl(*args, **kwargs)
+[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1857, in _call_impl
+[rank7]:     return inner()
+[rank7]:            ^^^^^^^
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1805, in inner
+[rank7]:     result = forward_call(*args, **kwargs)
+[rank7]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 655, in _fn
+[rank7]:     return fn(*args, **kwargs)
+[rank7]:            ^^^^^^^^^^^^^^^^^^^
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
+[rank7]:     return self._call_impl(*args, **kwargs)
+[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
+[rank7]:     return forward_call(*args, **kwargs)
+[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/gated_deltanet/modeling_gated_deltanet.py", line 108, in forward
+[rank7]:     hidden_states = self.attn_norm(hidden_states)
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/models/gated_deltanet/modeling_gated_deltanet.py", line 109, in torch_dynamo_resume_in_forward_at_108
+[rank7]:     hidden_states, attentions, past_key_values = self.attn(
+[rank7]:                                                  ^^^^^^^^^^
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
+[rank7]:     return self._call_impl(*args, **kwargs)
+[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
+[rank7]:     return forward_call(*args, **kwargs)
+[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/nsa.py", line 108, in forward
+[rank7]:     q, k = self.rotary(q, k, seqlen_offset=seqlen_offset, max_seqlen=max_seqlen, cu_seqlens=cu_seqlens)
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/layers/nsa.py", line 123, in torch_dynamo_resume_in_forward_at_108
+[rank7]:     o = parallel_nsa(
+[rank7]:         ^^^^^^^^^^^^^
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/ops/nsa/parallel.py", line 838, in parallel_nsa
+[rank7]:     o_cmp, lse_cmp = parallel_nsa_compression(
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/ops/nsa/parallel.py", line 857, in torch_dynamo_resume_in_parallel_nsa_at_838
+[rank7]:     o = o_slc = ParallelNSAFunction.apply(q, k, v, block_indices, block_counts, block_size, scale, cu_seqlens)
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 1432, in __call__
+[rank7]:     return self._torchdynamo_orig_callable(
+[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 1213, in __call__
+[rank7]:     result = self._inner_convert(
+[rank7]:              ^^^^^^^^^^^^^^^^^^^^
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 598, in __call__
+[rank7]:     return _compile(
+[rank7]:            ^^^^^^^^^
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 1110, in _compile
+[rank7]:     raise InternalTorchDynamoError(
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 1059, in _compile
+[rank7]:     guarded_code = compile_inner(code, one_graph, hooks, transform)
+[rank7]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_utils_internal.py", line 97, in wrapper_function
+[rank7]:     return function(*args, **kwargs)
+[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 761, in compile_inner
+[rank7]:     return _compile_inner(code, one_graph, hooks, transform)
+[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 797, in _compile_inner
+[rank7]:     out_code = transform_code_object(code, transform)
+[rank7]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/bytecode_transformation.py", line 1422, in transform_code_object
+[rank7]:     transformations(instructions, code_options)
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 257, in _fn
+[rank7]:     return fn(*args, **kwargs)
+[rank7]:            ^^^^^^^^^^^^^^^^^^^
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 715, in transform
+[rank7]:     tracer.run()
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 3498, in run
+[rank7]:     super().run()
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 1337, in run
+[rank7]:     while self.step():
+[rank7]:           ^^^^^^^^^^^
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 1246, in step
+[rank7]:     self.dispatch_table[inst.opcode](self, inst)
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 2157, in COMPARE_OP
+[rank7]:     self.push(compare_op_handlers[inst.argval](self, self.popn(2), {}))
+[rank7]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 1111, in call_function
+[rank7]:     return handler(tx, args, kwargs)
+[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 789, in <lambda>
+[rank7]:     return lambda tx, args, kwargs: obj.call_function(
+[rank7]:                                     ^^^^^^^^^^^^^^^^^^
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 1111, in call_function
+[rank7]:     return handler(tx, args, kwargs)
+[rank7]:            ^^^^^^^^^^^^^^^^^^^^^^^^^
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 945, in builtin_dispatch
+[rank7]:     rv = fn(tx, args, kwargs)
+[rank7]:          ^^^^^^^^^^^^^^^^^^^^
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 839, in call_binop_handlers
+[rank7]:     rv = fn(tx, *args)
+[rank7]:          ^^^^^^^^^^^^^
+[rank7]:   File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flame/.venv/lib/python3.11/site-packages/torch/_dynamo/variables/builtin.py", line 533, in compare_by_value
+[rank7]:     return ConstantVariable(op(a.value, b.value))
+[rank7]:                             ^^^^^^^^^^^^^^^^^^^^
+[rank7]: torch._dynamo.exc.InternalTorchDynamoError: TypeError: '>' not supported between instances of 'NoneType' and 'int'
+[rank7]: from user code:
+[rank7]:    File "/mnt/nanjingcephfs/project_wx-rec-alg-bdc-exp/bwzheng/yulan/hyw/dataset/Selection/pretrain-linear-moe/flash-linear-attention/fla/ops/nsa/parallel.py", line 862, in torch_dynamo_resume_in_parallel_nsa_at_857
+[rank7]:     if window_size > 0:
+[rank7]: Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_3w7_34bf/attempt_0/7/stdout.log ADDED Viewed

File without changes

bf16-gdn_6_nsa_1_340M.json-ctx8192-steps95366-lr3e-4-decay_typelinear-decay_ratio1-bs8-nn1-gas2/logs/none_ij1w4wht/attempt_0/0/stderr.log ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:24ee09962378f99896314a1b7b782822c8cbdf18ec5a1b15bfae2560104cfa85
+size 28822522