SearchAgent_Leaderboard

Running

App Files Files Community

shyuli commited on Sep 29

Commit

b1b6669

1 Parent(s): 2354057

version v0.1

Browse files

Files changed (1) hide show

README.md +91 -0

README.md ADDED Viewed

	@@ -0,0 +1,91 @@

+title: SearchAgent Leaderboard
+emoji: 🥇
+colorFrom: green
+colorTo: indigo
+sdk: gradio
+app_file: app.py
+pinned: true
+license: apache-2.0
+short_description: A standardized leaderboard for search-augmented QA agents (NQ/TriviaQA/PopQA/HotpotQA/2wiki/Musique/Bamboogle/FictionalHot)
+sdk_version: 5.43.1
+tags:
+  - leaderboard
+---
+# Overview
+SearchAgent Leaderboard provides a simple, standardized way to compare search-augmented QA agents across:
+- General QA: NQ, TriviaQA, PopQA
+- Multi-hop QA: HotpotQA, 2wiki, Musique, Bamboogle
+- Novel closed-world: FictionalHot
+We display a minimal set of columns for clarity:
+- Rank, Model, Average, per-dataset scores, Model Size (3B/7B)
+# Data format (results)
+Place model result files in `eval-results/` as JSON. Scores are decimals in [0,1] (the UI multiplies by 100).
+```json
+{
+  "config": {
+    "model_dtype": "torch.float16",
+    "model_name": "YourMethod-Qwen2.5-7b-Instruct",
+    "model_sha": "main"
+  },
+  "results": {
+    "nq": { "exact_match": 0.469 },
+    "triviaqa": { "exact_match": 0.640 },
+    "popqa": { "exact_match": 0.501 },
+    "hotpotqa": { "exact_match": 0.389 },
+    "2wiki": { "exact_match": 0.382 },
+    "musique": { "exact_match": 0.185 },
+    "bamboogle": { "exact_match": 0.392 },
+    "fictionalhot": { "exact_match": 0.061 }
+  }
+}
+```
+Notes:
+- `model_name` uses the format `Method-Qwen2.5-{3b|7b}-Instruct` (no org prefix required)
+- Tasks: `nq`, `triviaqa`, `popqa`, `hotpotqa`, `2wiki`, `musique`, `bamboogle`, `fictionalhot`
+- Metric key: `exact_match`
+# Submission (via Community)
+We accept submissions via the Space Community (Discussions):
+1) Open the Space page and go to Community: `https://huggingface.co/spaces/TencentBAC/SearchAgent_Leaderboard`
+2) Create a discussion with title `Submission: <YourMethod>-<model_name>-<model_size>`
+3) Include:
+   - Model weights link (HF or GitHub)
+   - Short method description
+   - Evaluation JSON (inline or attached)
+# Local development
+Run locally (example):
+```bash
+python app.py
+```
+The app reads local data only (no remote download) from:
+- Results: `./eval-results`
+- (Optional) Requests: `./eval-queue` (not required for the simplified table)
+If you see missing dependencies, install minimally:
+```bash
+pip install gradio gradio_leaderboard pandas huggingface_hub apscheduler
+```
+# Customize
+- Tasks and page texts: `src/about.py`
+- Displayed columns: `src/display/utils.py` (we keep Rank, Model, Average, per-dataset, Model Size)
+- Custom model links (name→URL mapping): `src/display/formatting.py` (`custom_links` dict)
+- Data loading and ranking: `src/leaderboard/read_evals.py`, `src/populate.py`
+Restart the app after changes.