Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models Paper • 2606.11409 • Published 15 days ago • 9
Reassessing High-Performing LLMs on Polish Medical Exams: True Competence or Bias-Driven Performance? Paper • 2606.12250 • Published 14 days ago
VLM-RobustBench: A Comprehensive Benchmark for Robustness of Vision-Language Models Paper • 2603.06148 • Published Mar 6 • 2
SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks Paper • 2605.31433 • Published 26 days ago • 28
SCOPE: Self-Play via Co-Evolving Policies for Open-Ended Tasks Paper • 2605.31433 • Published 26 days ago • 28
Scaling Low-Resource MT via Synthetic Data Generation with LLMs Paper • 2505.14423 • Published May 20, 2025 • 2
Gained in Translation: Privileged Pairwise Judges Enhance Multilingual Reasoning Paper • 2601.18722 • Published Jan 26
EvoClaw: Evaluating AI Agents on Continuous Software Evolution Paper • 2603.13428 • Published Mar 13 • 21