how do you sync a trillion parameter model every RL step without a shared cluster? we just wrote a blog about it, led by @aminediroHF
what I like the most is the way it proves you can use the Hub for basically everything π§ β trainer on one machine, vLLM in a HF Space, the wordle env in another HF Space and weights going through a Hub Bucket. no shared cluster, just HTTPS
it works because ~99% of bf16 weights don't change between RL steps so you only sync the diff. 1.2 GB to 25 MB of payload per step
most multi-turn RL loops have a silent bug: you decode the model's output to detect tool calls, then re-tokenize the conversation for the next turn. BPE isn't invertible, so decode then re-encode can land on different ids. gradient ends up on tokens the model never sampled. no crash, just quietly wrong math and broken training
@qgallouedec wrote a super educational blog on MITO (message-in, token-out) vs TITO (token-in, token-out) and how you might fix the problem above
@ariG23498 is starting a blog series about profiling in pytorch and part 1 just dropped
takes you from the simplest scenario to actually knowing what your gpu is doing. if you have never opened a profiler trace this is where you start
covers torch.profiler from scratch. reading tables and traces, overhead bound vs compute bound, the full dispatch chain from python to gpu kernels, and what torch.compile is actually fusing under the hood
In 2017, Google released the Transformer architecture. While it was clear the model was promising, absolutely no one (including its authors) anticipated the pervasive global revolution it would create!
The authors actually viewed the Transformer as just a stepping stone for a much more ambitious project: The MultiModel.
Their ultimate goal was to build a single deep learning architecture capable of jointly learning massive, diverse tasks across entirely different domains (in 2017). A One Model To Learn Them All.
In fact, the MultiModel paper was published in the exact same month as Attention Is All You Need!
But history had other plans. The building block eclipsed the grand design!
So, have you heard about the MultiModel before? π
If you have a github repo, you basically have an RL training environment
We're introducing Repo2RLEnv (built by @AdithyaSK), a tool that mines PRs, commits, CVEs and turns them into verifiable sandboxed tasks with real reward signals, automatically
Outputs to Harbor spec so you can plug it straight into RL training or coding-agent eval
SRT-introspect: Live Token-by-Token Readout of LLM Internal Reasoning
I have released SRT-introspect, a new public demonstration that makes the hidden reasoning process of a frozen large language model visible in real time.
The interface runs a Qwen-2.5-7B backbone equipped with the SRT Adapter and Activation Verbalizer. As the model generates each token, the system continuously measures divergence across attention heads, identifies high-signal moments, and translates the corresponding hidden-state object representations into natural-language verbalizations. You see exactly what the model is internally representing at the precise points where its computation is most active, complete with divergence scores, reflexivity estimates, and per-layer traces.
This is not a summary of the final output. It is a direct window into the modelβs latent conceptual landscape, showing the dominant training-data attractors that activate even when the prompt asks for first-principles reasoning. The adaptive scheduler concentrates verbalizations precisely where the real internal work occurs, turning what used to be opaque black-box generation into observable, analyzable data.
The result is the clearest public demonstration yet that modern LLMs possess a rich, structured semiotic infrastructure that can now be audited without retraining or fine-tuning.
A single forward pass of the frozen Qwen-2.5-7B model plus a lightweight classifier reaches 0.866 plus or minus 0.011 AUC on the full TruthfulQA-MC2 benchmark. No adapters. No fine-tuning. No extra parameters on the backbone.
This is the strongest hidden-state truthfulness detector reported on the benchmark to date.
The same latent features that the SRT-NLA-AV-v1 demo reads out as coherent natural-language verbalizations turn out to be rich enough to support production-grade auditing for honesty versus hallucination. The internal semiotic infrastructure we have been exploring in public is already information-dense enough to solve hard downstream problems with almost trivial overhead.
π§ New Space: MindReader-NLA β ask a frozen LM what it's thinking, in plain English.
A trained Activation Verbalizer (~5β13M params, frozen backbone) over Qwen-2.5-7B, Llama-3.2-3B, and Gemma-2-2B. Three demos in one Space:
Playground β sample K verbalizations of the layer-L hidden state and score how well each reproduces the original activation when fed back through the same frozen model (raw + anisotropy-centred cosine FVE).
Live Thought Trace β stream a verbalization per token as the model writes, side-by-side with the generation.
Steer-by-Editing β edit the verbalized thought, project it back into hidden-state space, and watch the continuation change.
Natural Language Autoencoders: A Window into Latent Structure
I introduced a concise mathematical formulation of the P versus NP question into the SRT-NLA-AV-v1 demonstration:
P vs NP asks whether every problem whose solution can be verified in polynomial time (NP) can also be solved in polynomial time (P). Integer factorization β given N = pΒ·q where p and q are large primes (p < q) β is in NP but widely believed not to be in P.
The resulting activation verbalization (best-of-N, reranked by AR fidelity) surfaced:
βThis article originally appeared in the August 2016 edition of CACM. A new method of proving computational hardness of problems, known as multilinearization, can improve efficiency, reduce complexity and simplify proofs. In this article, I describe multilinearization and its application to several key problems, from the discrete logarithm and factoring to RSA and elliptic-curve discrete logarithms.β
What emerges is not a literal restatement, but a structured articulation of the modelβs internal associations: hardness proofs, algebraic techniques, and the cryptographic implications that orbit this foundational question in computational complexity.
The demo offers a compelling interface for exploring these latent representations.
Turns out : if we predict π earth we can save a lot of time looking for interesting things and less time looking at things that we expect to see.
Sentinel-2 imagery π°οΈbasically takes a long time to download towards earth. so our "near real time" systems are quite far from that in practical terms.
meanwhile , if we "predict" what we will see , based on what we do see , we can send down much less data in a timely way , and prioritize π‘earth-bound response .
I'm talking about illegal fishing , logging , mining or building in nature reserves , the more of that we predict early the more we're able to stop it on time.
Going forward, I plan to upload models from major 1st party developers without the author name attached for cleanliness, I feel it results in a nicer and more expected user experience
I will continue to uploaded fine tunes with that author + "_" appended for clarity, I personally feel it's nice to know at a glance who's tune it is, but it's also for the reason I first started doing it, to avoid it being confused for a new version of the official release
I hope this change makes sense, it seemed most reasonable to me and a poll I did (forever ago, I move slow sometimes) made it seem likely others would find it reasonable as well (feel free to let me know if you disagree, may not change my mind but I do value knowing what others think)
@retrain-pipelines v0.2.0 is out ! I'm at Station F at My booth with GOSIM Paris 2026 today & tomorrow. Come meet me for a live in-person demo and a chat !