Policy optimization (PO) algorithms are central to training AI models with preference-based feedback. In recent weeks, numerous new PO methods have emerged that build on or replace the popular PPO and GRPO, solving their issues. Here are 11 of them:
3. Asymmetric Importance Sampling Policy Optimization (ASPO) β ASPO: Asymmetric Importance Sampling Policy Optimization (2510.06062) Fixes imbalanced token weighting in LLM training. It flips the importance sampling ratios for positive tokens to correct over- and under-updates, and adds a soft dual-clipping step to keep gradients stable
4. In-Context Steered Policy Optimization (ICPO) β https://arxiv.org/abs/2510.26519 Uses a modelβs own in-context learning ability to guide training with existing data. It combines Mixed-Policy GRPO with Implicit Expert Forcing to expand exploration and adds Expert Region Reject Sampling and Annealed Expert-Bonus Reward Shaping to ensure stability and balanced expert influence
5. Graph-Enhanced Policy Optimization (GEPO) β https://arxiv.org/abs/2510.26270 Builds a graph of an agentβs experiences to understand how different states connect, guide exploration and assign rewards more effectively
πππ The largest ever dataset of co-folded 3D protein-ligand structures just dropped on HF!!
Meet SAIR (Structurally Augmented ICβ β Repository): 5M+ AI-generated complexes with experimentally measured drug potency data from SandboxAQ. πππ
Extra Info I tested newly arrived Qwen-Image-Edit-Lightning-8steps-V1.0 (arrived after tutorial recorded) and definitely our existing preset which uses Qwen-Image-Lightning-8steps-V1.1 is better than it, so this tutorial and presets are still 100% best quality and up-to-date
Info Qwen Image Edit just has been published and since then I have been experimenting to prepare you this amazing tutorial. I have literally shown 26 unique cases and provided demo images and prompts. After watching this tutorial your image editing skills will move to next level i promise you that. Also this tutorial will give you a lot of ideas.
I run Qwen3-Coder 480B locally on my Z8, with a 1-million token context window. Itβs the equivalent of parallel-parking a Nimitz-class carrier in a kiddie pool. Thanks to whatever dark pact the llama.cpp, CUDA, and kernel folks signed, hybrid inferencing + VRAMβRAM offload let me stream the modelβs synapses across Xeon, RAM, and four lonely A6000s without summoning either the OOM killer or a small house fire.