SearchAgent_Leaderboard

Running

App Files Files Community

SearchAgent_Leaderboard / README.md

shyuli

version v0.1

eae2329 2 months ago

preview code

raw

history blame contribute delete

2.6 kB

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

metadata

title: SearchAgent_Leaderboard
emoji: 🥇
colorFrom: green
colorTo: indigo
sdk: gradio
app_file: app.py
pinned: true
license: apache-2.0
short_description: A standardized leaderboard for search agents
sdk_version: 4.23.0
tags:
  - leaderboard

Overview

SearchAgent Leaderboard provides a simple, standardized way to compare search-augmented QA agents across:

General QA: NQ, TriviaQA, PopQA
Multi-hop QA: HotpotQA, 2wiki, Musique, Bamboogle
Novel closed-world: FictionalHot

We display a minimal set of columns for clarity:

Rank, Model, Average, per-dataset scores, Model Size (3B/7B)

Data format (results)

Place model result files in eval-results/ as JSON. Scores are decimals in [0,1] (the UI multiplies by 100).

{
  "config": {
    "model_dtype": "torch.float16",
    "model_name": "YourMethod-Qwen2.5-7b-Instruct",
    "model_sha": "main"
  },
  "results": {
    "nq": { "exact_match": 0.469 },
    "triviaqa": { "exact_match": 0.640 },
    "popqa": { "exact_match": 0.501 },
    "hotpotqa": { "exact_match": 0.389 },
    "2wiki": { "exact_match": 0.382 },
    "musique": { "exact_match": 0.185 },
    "bamboogle": { "exact_match": 0.392 },
    "fictionalhot": { "exact_match": 0.061 }
  }
}

Notes:

model_name uses the format Method-Qwen2.5-{3b|7b}-Instruct (no org prefix required)
Tasks: nq, triviaqa, popqa, hotpotqa, 2wiki, musique, bamboogle, fictionalhot
Metric key: exact_match

Submission (via Community)

We accept submissions via the Space Community (Discussions):

Open the Space page and go to Community: https://huggingface.co/spaces/TencentBAC/SearchAgent_Leaderboard
Create a discussion with title Submission: <YourMethod>-<model_name>-<model_size>
Include:
- Model weights link (HF or GitHub)
- Short method description
- Evaluation JSON (inline or attached)

Local development

Run locally (example):

python app.py

The app reads local data only (no remote download) from:

Results: ./eval-results
(Optional) Requests: ./eval-queue (not required for the simplified table)

If you see missing dependencies, install minimally:

pip install gradio gradio_leaderboard pandas huggingface_hub apscheduler

Customize

Tasks and page texts: src/about.py
Displayed columns: src/display/utils.py (we keep Rank, Model, Average, per-dataset, Model Size)
Custom model links (name→URL mapping): src/display/formatting.py (custom_links dict)
Data loading and ranking: src/leaderboard/read_evals.py, src/populate.py

Restart the app after changes.