DR^{3}-Eval: Towards Realistic and Reproducible Deep Research Evaluation
Abstract
DR$^{3}$-Eval is a benchmark for evaluating deep research agents on multimodal, multi-file report generation, featuring a realistic simulation of web environments and a comprehensive evaluation framework.
Deep Research Agents (DRAs) aim to solve complex, long-horizon research tasks involving planning, retrieval, multimodal understanding, and report generation, yet their evaluation remains challenging due to dynamic web environments and ambiguous task definitions. We propose DR^{3}-Eval, a realistic and reproducible benchmark for evaluating deep research agents on multimodal, multi-file report generation. DR^{3}-Eval is constructed from authentic user-provided materials and paired with a per-task static research sandbox corpus that simulates open-web complexity while remaining fully verifiable, containing supportive documents, distractors, and noise. Moreover, we introduce a multi-dimensional evaluation framework measuring Information Recall, Factual Accuracy, Citation Coverage, Instruction Following, and Depth Quality, and validate its alignment with human judgments. Experiments with our developed multi-agent system DR^{3}-Agent based on multiple state-of-the-art language models demonstrate that DR^{3}-Eval is highly challenging and reveals critical failure modes in retrieval robustness and hallucination control. Our code and data are publicly available.
Community
Interesting breakdown of this paper on arXivLens: https://arxivlens.com/PaperView/Details/dr-3-eval-towards-realistic-and-reproducible-deep-research-evaluation-74-0eb2f238
Covers the executive summary, detailed methodology, and practical applications.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DREAM: Deep Research Evaluation with Agentic Metrics (2026)
- Super Research: Answering Highly Complex Questions with Large Language Models through Super Deep and Super Wide Research (2026)
- MC-Search: Evaluating and Enhancing Multimodal Agentic Search with Structured Long Reasoning Chains (2026)
- Deep-Reporter: Deep Research for Grounded Multimodal Long-Form Generation (2026)
- Marco DeepResearch: Unlocking Efficient Deep Research Agents via Verification-Centric Design (2026)
- ATP-Bench: Towards Agentic Tool Planning for MLLM Interleaved Generation (2026)
- PaperScope: A Multi-Modal Multi-Document Benchmark for Agentic Deep Research Across Massive Scientific Papers (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.14683 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper