File size: 2,603 Bytes
7ef709f b1cccfe b1b6669 9ca07ef eae2329 b1b6669 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 |
---
title: SearchAgent_Leaderboard
emoji: 🥇
colorFrom: green
colorTo: indigo
sdk: gradio
app_file: app.py
pinned: true
license: apache-2.0
short_description: A standardized leaderboard for search agents
sdk_version: 4.23.0
tags:
- leaderboard
---
# Overview
SearchAgent Leaderboard provides a simple, standardized way to compare search-augmented QA agents across:
- General QA: NQ, TriviaQA, PopQA
- Multi-hop QA: HotpotQA, 2wiki, Musique, Bamboogle
- Novel closed-world: FictionalHot
We display a minimal set of columns for clarity:
- Rank, Model, Average, per-dataset scores, Model Size (3B/7B)
# Data format (results)
Place model result files in `eval-results/` as JSON. Scores are decimals in [0,1] (the UI multiplies by 100).
```json
{
"config": {
"model_dtype": "torch.float16",
"model_name": "YourMethod-Qwen2.5-7b-Instruct",
"model_sha": "main"
},
"results": {
"nq": { "exact_match": 0.469 },
"triviaqa": { "exact_match": 0.640 },
"popqa": { "exact_match": 0.501 },
"hotpotqa": { "exact_match": 0.389 },
"2wiki": { "exact_match": 0.382 },
"musique": { "exact_match": 0.185 },
"bamboogle": { "exact_match": 0.392 },
"fictionalhot": { "exact_match": 0.061 }
}
}
```
Notes:
- `model_name` uses the format `Method-Qwen2.5-{3b|7b}-Instruct` (no org prefix required)
- Tasks: `nq`, `triviaqa`, `popqa`, `hotpotqa`, `2wiki`, `musique`, `bamboogle`, `fictionalhot`
- Metric key: `exact_match`
# Submission (via Community)
We accept submissions via the Space Community (Discussions):
1) Open the Space page and go to Community: `https://huggingface.co/spaces/TencentBAC/SearchAgent_Leaderboard`
2) Create a discussion with title `Submission: <YourMethod>-<model_name>-<model_size>`
3) Include:
- Model weights link (HF or GitHub)
- Short method description
- Evaluation JSON (inline or attached)
# Local development
Run locally (example):
```bash
python app.py
```
The app reads local data only (no remote download) from:
- Results: `./eval-results`
- (Optional) Requests: `./eval-queue` (not required for the simplified table)
If you see missing dependencies, install minimally:
```bash
pip install gradio gradio_leaderboard pandas huggingface_hub apscheduler
```
# Customize
- Tasks and page texts: `src/about.py`
- Displayed columns: `src/display/utils.py` (we keep Rank, Model, Average, per-dataset, Model Size)
- Custom model links (name→URL mapping): `src/display/formatting.py` (`custom_links` dict)
- Data loading and ranking: `src/leaderboard/read_evals.py`, `src/populate.py`
Restart the app after changes. |