shyuli commited on
Commit
b1b6669
·
1 Parent(s): 2354057

version v0.1

Browse files
Files changed (1) hide show
  1. README.md +91 -0
README.md ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ title: SearchAgent Leaderboard
2
+ emoji: 🥇
3
+ colorFrom: green
4
+ colorTo: indigo
5
+ sdk: gradio
6
+ app_file: app.py
7
+ pinned: true
8
+ license: apache-2.0
9
+ short_description: A standardized leaderboard for search-augmented QA agents (NQ/TriviaQA/PopQA/HotpotQA/2wiki/Musique/Bamboogle/FictionalHot)
10
+ sdk_version: 5.43.1
11
+ tags:
12
+ - leaderboard
13
+ ---
14
+
15
+ # Overview
16
+
17
+ SearchAgent Leaderboard provides a simple, standardized way to compare search-augmented QA agents across:
18
+ - General QA: NQ, TriviaQA, PopQA
19
+ - Multi-hop QA: HotpotQA, 2wiki, Musique, Bamboogle
20
+ - Novel closed-world: FictionalHot
21
+
22
+ We display a minimal set of columns for clarity:
23
+ - Rank, Model, Average, per-dataset scores, Model Size (3B/7B)
24
+
25
+
26
+ # Data format (results)
27
+
28
+ Place model result files in `eval-results/` as JSON. Scores are decimals in [0,1] (the UI multiplies by 100).
29
+
30
+ ```json
31
+ {
32
+ "config": {
33
+ "model_dtype": "torch.float16",
34
+ "model_name": "YourMethod-Qwen2.5-7b-Instruct",
35
+ "model_sha": "main"
36
+ },
37
+ "results": {
38
+ "nq": { "exact_match": 0.469 },
39
+ "triviaqa": { "exact_match": 0.640 },
40
+ "popqa": { "exact_match": 0.501 },
41
+ "hotpotqa": { "exact_match": 0.389 },
42
+ "2wiki": { "exact_match": 0.382 },
43
+ "musique": { "exact_match": 0.185 },
44
+ "bamboogle": { "exact_match": 0.392 },
45
+ "fictionalhot": { "exact_match": 0.061 }
46
+ }
47
+ }
48
+ ```
49
+
50
+ Notes:
51
+ - `model_name` uses the format `Method-Qwen2.5-{3b|7b}-Instruct` (no org prefix required)
52
+ - Tasks: `nq`, `triviaqa`, `popqa`, `hotpotqa`, `2wiki`, `musique`, `bamboogle`, `fictionalhot`
53
+ - Metric key: `exact_match`
54
+
55
+
56
+ # Submission (via Community)
57
+
58
+ We accept submissions via the Space Community (Discussions):
59
+ 1) Open the Space page and go to Community: `https://huggingface.co/spaces/TencentBAC/SearchAgent_Leaderboard`
60
+ 2) Create a discussion with title `Submission: <YourMethod>-<model_name>-<model_size>`
61
+ 3) Include:
62
+ - Model weights link (HF or GitHub)
63
+ - Short method description
64
+ - Evaluation JSON (inline or attached)
65
+
66
+
67
+ # Local development
68
+
69
+ Run locally (example):
70
+ ```bash
71
+ python app.py
72
+ ```
73
+
74
+ The app reads local data only (no remote download) from:
75
+ - Results: `./eval-results`
76
+ - (Optional) Requests: `./eval-queue` (not required for the simplified table)
77
+
78
+ If you see missing dependencies, install minimally:
79
+ ```bash
80
+ pip install gradio gradio_leaderboard pandas huggingface_hub apscheduler
81
+ ```
82
+
83
+
84
+ # Customize
85
+
86
+ - Tasks and page texts: `src/about.py`
87
+ - Displayed columns: `src/display/utils.py` (we keep Rank, Model, Average, per-dataset, Model Size)
88
+ - Custom model links (name→URL mapping): `src/display/formatting.py` (`custom_links` dict)
89
+ - Data loading and ranking: `src/leaderboard/read_evals.py`, `src/populate.py`
90
+
91
+ Restart the app after changes.