Solaris: Building a Multiplayer Video World Model in Minecraft
Abstract
Solaris is a multiplayer video world model that simulates consistent multi-view observations through a novel data collection system and staged training approach.
Existing action-conditioned video generation models (video world models) are limited to single-agent perspectives, failing to capture the multi-agent interactions of real-world environments. We introduce Solaris, a multiplayer video world model that simulates consistent multi-view observations. To enable this, we develop a multiplayer data system designed for robust, continuous, and automated data collection on video games such as Minecraft. Unlike prior platforms built for single-player settings, our system supports coordinated multi-agent interaction and synchronized videos + actions capture. Using this system, we collect 12.64 million multiplayer frames and propose an evaluation framework for multiplayer movement, memory, grounding, building, and view consistency. We train Solaris using a staged pipeline that progressively transitions from single-player to multiplayer modeling, combining bidirectional, causal, and Self Forcing training. In the final stage, we introduce Checkpointed Self Forcing, a memory-efficient Self Forcing variant that enables a longer-horizon teacher. Results show our architecture and training design outperform existing baselines. Through open-sourcing our system and models, we hope to lay the groundwork for a new generation of multi-agent world models.
Community
Solaris develops a multiplayer video world model for Minecraft, enabling coordinated multi-agent observations, scalable data collection, and a staged training pipeline with memory-efficient Checkpointed Self Forcing to beat baselines.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- LIVE: Long-horizon Interactive Video World Modeling (2026)
- VideoWorld 2: Learning Transferable Knowledge from Real-world Videos (2026)
- Causal World Modeling for Robot Control (2026)
- EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation (2026)
- NitroGen: An Open Foundation Model for Generalist Gaming Agents (2026)
- VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction (2026)
- Reward-Forcing: Autoregressive Video Generation with Reward Feedback (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper