Spaces:
Running
Running
Phase 3 Implementation Spec: Judge Vertical Slice
Goal: Implement the "Brain" of the agent β evaluating evidence quality. Philosophy: "Structured Output or Bust." Prerequisite: Phase 2 complete (all search tests passing)
1. The Slice Definition
This slice covers:
- Input: A user question + a list of
Evidence(from Phase 2). - Process:
- Construct a prompt with the evidence.
- Call LLM (PydanticAI / OpenAI / Anthropic).
- Force JSON structured output.
- Output: A
JudgeAssessmentobject.
Files to Create:
src/utils/models.py- Add JudgeAssessment models (extend from Phase 2)src/prompts/judge.py- Judge prompt templatessrc/agent_factory/judges.py- JudgeHandler with PydanticAItests/unit/agent_factory/test_judges.py- Unit tests
2. Models (Add to src/utils/models.py)
The output schema must be strict for reliable structured output.
"""Add these models to src/utils/models.py (after Evidence models from Phase 2)."""
from pydantic import BaseModel, Field
from typing import List, Literal
class AssessmentDetails(BaseModel):
"""Detailed assessment of evidence quality."""
mechanism_score: int = Field(
...,
ge=0,
le=10,
description="How well does the evidence explain the mechanism? 0-10"
)
mechanism_reasoning: str = Field(
...,
min_length=10,
description="Explanation of mechanism score"
)
clinical_evidence_score: int = Field(
...,
ge=0,
le=10,
description="Strength of clinical/preclinical evidence. 0-10"
)
clinical_reasoning: str = Field(
...,
min_length=10,
description="Explanation of clinical evidence score"
)
drug_candidates: List[str] = Field(
default_factory=list,
description="List of specific drug candidates mentioned"
)
key_findings: List[str] = Field(
default_factory=list,
description="Key findings from the evidence"
)
class JudgeAssessment(BaseModel):
"""Complete assessment from the Judge."""
details: AssessmentDetails
sufficient: bool = Field(
...,
description="Is evidence sufficient to provide a recommendation?"
)
confidence: float = Field(
...,
ge=0.0,
le=1.0,
description="Confidence in the assessment (0-1)"
)
recommendation: Literal["continue", "synthesize"] = Field(
...,
description="continue = need more evidence, synthesize = ready to answer"
)
next_search_queries: List[str] = Field(
default_factory=list,
description="If continue, what queries to search next"
)
reasoning: str = Field(
...,
min_length=20,
description="Overall reasoning for the recommendation"
)
3. Prompt Engineering (src/prompts/judge.py)
We treat prompts as code. They should be versioned and clean.
"""Judge prompts for evidence assessment."""
from typing import List
from src.utils.models import Evidence
SYSTEM_PROMPT = """You are an expert drug repurposing research judge.
Your task is to evaluate evidence from biomedical literature and determine if it's sufficient to recommend drug candidates for a given condition.
## Evaluation Criteria
1. **Mechanism Score (0-10)**: How well does the evidence explain the biological mechanism?
- 0-3: No clear mechanism, speculative
- 4-6: Some mechanistic insight, but gaps exist
- 7-10: Clear, well-supported mechanism of action
2. **Clinical Evidence Score (0-10)**: Strength of clinical/preclinical support?
- 0-3: No clinical data, only theoretical
- 4-6: Preclinical or early clinical data
- 7-10: Strong clinical evidence (trials, meta-analyses)
3. **Sufficiency**: Evidence is sufficient when:
- Combined scores >= 12 AND
- At least one specific drug candidate identified AND
- Clear mechanistic rationale exists
## Output Rules
- Always output valid JSON matching the schema
- Be conservative: only recommend "synthesize" when truly confident
- If continuing, suggest specific, actionable search queries
- Never hallucinate drug names or findings not in the evidence
"""
def format_user_prompt(question: str, evidence: List[Evidence]) -> str:
"""
Format the user prompt with question and evidence.
Args:
question: The user's research question
evidence: List of Evidence objects from search
Returns:
Formatted prompt string
"""
evidence_text = "\n\n".join([
f"### Evidence {i+1}\n"
f"**Source**: {e.citation.source.upper()} - {e.citation.title}\n"
f"**URL**: {e.citation.url}\n"
f"**Date**: {e.citation.date}\n"
f"**Content**:\n{e.content[:1500]}..."
if len(e.content) > 1500 else
f"### Evidence {i+1}\n"
f"**Source**: {e.citation.source.upper()} - {e.citation.title}\n"
f"**URL**: {e.citation.url}\n"
f"**Date**: {e.citation.date}\n"
f"**Content**:\n{e.content}"
for i, e in enumerate(evidence)
])
return f"""## Research Question
{question}
## Available Evidence ({len(evidence)} sources)
{evidence_text}
## Your Task
Evaluate this evidence and determine if it's sufficient to recommend drug repurposing candidates.
Respond with a JSON object matching the JudgeAssessment schema.
"""
def format_empty_evidence_prompt(question: str) -> str:
"""
Format prompt when no evidence was found.
Args:
question: The user's research question
Returns:
Formatted prompt string
"""
return f"""## Research Question
{question}
## Available Evidence
No evidence was found from the search.
## Your Task
Since no evidence was found, recommend search queries that might yield better results.
Set sufficient=False and recommendation="continue".
Suggest 3-5 specific search queries.
"""
4. JudgeHandler Implementation (src/agent_factory/judges.py)
Using PydanticAI for structured output with retry logic.
"""Judge handler for evidence assessment using PydanticAI."""
import os
from typing import List
import structlog
from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIModel
from pydantic_ai.models.anthropic import AnthropicModel
from src.utils.models import Evidence, JudgeAssessment, AssessmentDetails
from src.utils.config import settings
from src.prompts.judge import SYSTEM_PROMPT, format_user_prompt, format_empty_evidence_prompt
logger = structlog.get_logger()
def get_model():
"""Get the LLM model based on configuration."""
provider = getattr(settings, "llm_provider", "openai")
if provider == "anthropic":
return AnthropicModel(
model_name=getattr(settings, "anthropic_model", "claude-3-5-sonnet-20241022"),
api_key=os.getenv("ANTHROPIC_API_KEY"),
)
else:
return OpenAIModel(
model_name=getattr(settings, "openai_model", "gpt-4o"),
api_key=os.getenv("OPENAI_API_KEY"),
)
class JudgeHandler:
"""
Handles evidence assessment using an LLM with structured output.
Uses PydanticAI to ensure responses match the JudgeAssessment schema.
"""
def __init__(self, model=None):
"""
Initialize the JudgeHandler.
Args:
model: Optional PydanticAI model. If None, uses config default.
"""
self.model = model or get_model()
self.agent = Agent(
model=self.model,
result_type=JudgeAssessment,
system_prompt=SYSTEM_PROMPT,
retries=3,
)
async def assess(
self,
question: str,
evidence: List[Evidence],
) -> JudgeAssessment:
"""
Assess evidence and determine if it's sufficient.
Args:
question: The user's research question
evidence: List of Evidence objects from search
Returns:
JudgeAssessment with evaluation results
Raises:
JudgeError: If assessment fails after retries
"""
logger.info(
"Starting evidence assessment",
question=question[:100],
evidence_count=len(evidence),
)
# Format the prompt based on whether we have evidence
if evidence:
user_prompt = format_user_prompt(question, evidence)
else:
user_prompt = format_empty_evidence_prompt(question)
try:
# Run the agent with structured output
result = await self.agent.run(user_prompt)
assessment = result.data
logger.info(
"Assessment complete",
sufficient=assessment.sufficient,
recommendation=assessment.recommendation,
confidence=assessment.confidence,
)
return assessment
except Exception as e:
logger.error("Assessment failed", error=str(e))
# Return a safe default assessment on failure
return self._create_fallback_assessment(question, str(e))
def _create_fallback_assessment(
self,
question: str,
error: str,
) -> JudgeAssessment:
"""
Create a fallback assessment when LLM fails.
Args:
question: The original question
error: The error message
Returns:
Safe fallback JudgeAssessment
"""
return JudgeAssessment(
details=AssessmentDetails(
mechanism_score=0,
mechanism_reasoning="Assessment failed due to LLM error",
clinical_evidence_score=0,
clinical_reasoning="Assessment failed due to LLM error",
drug_candidates=[],
key_findings=[],
),
sufficient=False,
confidence=0.0,
recommendation="continue",
next_search_queries=[
f"{question} mechanism",
f"{question} clinical trials",
f"{question} drug candidates",
],
reasoning=f"Assessment failed: {error}. Recommend retrying with refined queries.",
)
class MockJudgeHandler:
"""
Mock JudgeHandler for testing without LLM calls.
Use this in unit tests to avoid API calls.
"""
def __init__(self, mock_response: JudgeAssessment | None = None):
"""
Initialize with optional mock response.
Args:
mock_response: The assessment to return. If None, uses default.
"""
self.mock_response = mock_response
self.call_count = 0
self.last_question = None
self.last_evidence = None
async def assess(
self,
question: str,
evidence: List[Evidence],
) -> JudgeAssessment:
"""Return the mock response."""
self.call_count += 1
self.last_question = question
self.last_evidence = evidence
if self.mock_response:
return self.mock_response
# Default mock response
return JudgeAssessment(
details=AssessmentDetails(
mechanism_score=7,
mechanism_reasoning="Mock assessment - good mechanism evidence",
clinical_evidence_score=6,
clinical_reasoning="Mock assessment - moderate clinical evidence",
drug_candidates=["Drug A", "Drug B"],
key_findings=["Finding 1", "Finding 2"],
),
sufficient=len(evidence) >= 3,
confidence=0.75,
recommendation="synthesize" if len(evidence) >= 3 else "continue",
next_search_queries=["query 1", "query 2"] if len(evidence) < 3 else [],
reasoning="Mock assessment for testing purposes",
)
5. TDD Workflow
Test File: tests/unit/agent_factory/test_judges.py
"""Unit tests for JudgeHandler."""
import pytest
from unittest.mock import AsyncMock, MagicMock, patch
from src.utils.models import (
Evidence,
Citation,
JudgeAssessment,
AssessmentDetails,
)
class TestJudgeHandler:
"""Tests for JudgeHandler."""
@pytest.mark.asyncio
async def test_assess_returns_assessment(self):
"""JudgeHandler should return JudgeAssessment from LLM."""
from src.agent_factory.judges import JudgeHandler
# Create mock assessment
mock_assessment = JudgeAssessment(
details=AssessmentDetails(
mechanism_score=8,
mechanism_reasoning="Strong mechanistic evidence",
clinical_evidence_score=7,
clinical_reasoning="Good clinical support",
drug_candidates=["Metformin"],
key_findings=["Neuroprotective effects"],
),
sufficient=True,
confidence=0.85,
recommendation="synthesize",
next_search_queries=[],
reasoning="Evidence is sufficient for synthesis",
)
# Mock the PydanticAI agent
mock_result = MagicMock()
mock_result.data = mock_assessment
with patch("src.agent_factory.judges.Agent") as mock_agent_class:
mock_agent = AsyncMock()
mock_agent.run = AsyncMock(return_value=mock_result)
mock_agent_class.return_value = mock_agent
handler = JudgeHandler()
# Replace the agent with our mock
handler.agent = mock_agent
evidence = [
Evidence(
content="Metformin shows neuroprotective properties...",
citation=Citation(
source="pubmed",
title="Metformin in AD",
url="https://pubmed.ncbi.nlm.nih.gov/12345/",
date="2024-01-01",
),
)
]
result = await handler.assess("metformin alzheimer", evidence)
assert result.sufficient is True
assert result.recommendation == "synthesize"
assert result.confidence == 0.85
assert "Metformin" in result.details.drug_candidates
@pytest.mark.asyncio
async def test_assess_empty_evidence(self):
"""JudgeHandler should handle empty evidence gracefully."""
from src.agent_factory.judges import JudgeHandler
mock_assessment = JudgeAssessment(
details=AssessmentDetails(
mechanism_score=0,
mechanism_reasoning="No evidence to assess",
clinical_evidence_score=0,
clinical_reasoning="No evidence to assess",
drug_candidates=[],
key_findings=[],
),
sufficient=False,
confidence=0.0,
recommendation="continue",
next_search_queries=["metformin alzheimer mechanism"],
reasoning="No evidence found, need to search more",
)
mock_result = MagicMock()
mock_result.data = mock_assessment
with patch("src.agent_factory.judges.Agent") as mock_agent_class:
mock_agent = AsyncMock()
mock_agent.run = AsyncMock(return_value=mock_result)
mock_agent_class.return_value = mock_agent
handler = JudgeHandler()
handler.agent = mock_agent
result = await handler.assess("metformin alzheimer", [])
assert result.sufficient is False
assert result.recommendation == "continue"
assert len(result.next_search_queries) > 0
@pytest.mark.asyncio
async def test_assess_handles_llm_failure(self):
"""JudgeHandler should return fallback on LLM failure."""
from src.agent_factory.judges import JudgeHandler
with patch("src.agent_factory.judges.Agent") as mock_agent_class:
mock_agent = AsyncMock()
mock_agent.run = AsyncMock(side_effect=Exception("API Error"))
mock_agent_class.return_value = mock_agent
handler = JudgeHandler()
handler.agent = mock_agent
evidence = [
Evidence(
content="Some content",
citation=Citation(
source="pubmed",
title="Title",
url="url",
date="2024",
),
)
]
result = await handler.assess("test question", evidence)
# Should return fallback, not raise
assert result.sufficient is False
assert result.recommendation == "continue"
assert "failed" in result.reasoning.lower()
class TestMockJudgeHandler:
"""Tests for MockJudgeHandler."""
@pytest.mark.asyncio
async def test_mock_handler_returns_default(self):
"""MockJudgeHandler should return default assessment."""
from src.agent_factory.judges import MockJudgeHandler
handler = MockJudgeHandler()
evidence = [
Evidence(
content="Content 1",
citation=Citation(source="pubmed", title="T1", url="u1", date="2024"),
),
Evidence(
content="Content 2",
citation=Citation(source="web", title="T2", url="u2", date="2024"),
),
]
result = await handler.assess("test", evidence)
assert handler.call_count == 1
assert handler.last_question == "test"
assert len(handler.last_evidence) == 2
assert result.details.mechanism_score == 7
@pytest.mark.asyncio
async def test_mock_handler_custom_response(self):
"""MockJudgeHandler should return custom response when provided."""
from src.agent_factory.judges import MockJudgeHandler
custom_assessment = JudgeAssessment(
details=AssessmentDetails(
mechanism_score=10,
mechanism_reasoning="Custom reasoning",
clinical_evidence_score=10,
clinical_reasoning="Custom clinical",
drug_candidates=["CustomDrug"],
key_findings=["Custom finding"],
),
sufficient=True,
confidence=1.0,
recommendation="synthesize",
next_search_queries=[],
reasoning="Custom assessment",
)
handler = MockJudgeHandler(mock_response=custom_assessment)
result = await handler.assess("test", [])
assert result.details.mechanism_score == 10
assert result.details.drug_candidates == ["CustomDrug"]
@pytest.mark.asyncio
async def test_mock_handler_insufficient_with_few_evidence(self):
"""MockJudgeHandler should recommend continue with < 3 evidence."""
from src.agent_factory.judges import MockJudgeHandler
handler = MockJudgeHandler()
# Only 2 pieces of evidence
evidence = [
Evidence(
content="Content",
citation=Citation(source="pubmed", title="T", url="u", date="2024"),
),
Evidence(
content="Content 2",
citation=Citation(source="web", title="T2", url="u2", date="2024"),
),
]
result = await handler.assess("test", evidence)
assert result.sufficient is False
assert result.recommendation == "continue"
assert len(result.next_search_queries) > 0
6. Dependencies
Add to pyproject.toml:
[project]
dependencies = [
# ... existing deps ...
"pydantic-ai>=0.0.16",
"openai>=1.0.0",
"anthropic>=0.18.0",
]
7. Configuration (src/utils/config.py)
Add LLM configuration:
"""Add to src/utils/config.py."""
from pydantic_settings import BaseSettings
from typing import Literal
class Settings(BaseSettings):
"""Application settings."""
# LLM Configuration
llm_provider: Literal["openai", "anthropic"] = "openai"
openai_model: str = "gpt-4o"
anthropic_model: str = "claude-3-5-sonnet-20241022"
# API Keys (loaded from environment)
openai_api_key: str | None = None
anthropic_api_key: str | None = None
ncbi_api_key: str | None = None
class Config:
env_file = ".env"
env_file_encoding = "utf-8"
settings = Settings()
8. Implementation Checklist
- Add
AssessmentDetailsandJudgeAssessmentmodels tosrc/utils/models.py - Create
src/prompts/__init__.py(empty, for package) - Create
src/prompts/judge.pywith prompt templates - Create
src/agent_factory/__init__.pywith exports - Implement
src/agent_factory/judges.pywith JudgeHandler - Update
src/utils/config.pywith LLM settings - Create
tests/unit/agent_factory/__init__.py - Write tests in
tests/unit/agent_factory/test_judges.py - Run
uv run pytest tests/unit/agent_factory/ -vβ ALL TESTS MUST PASS - Commit:
git commit -m "feat: phase 3 judge slice complete"
9. Definition of Done
Phase 3 is COMPLETE when:
- All unit tests pass:
uv run pytest tests/unit/agent_factory/ -v JudgeHandlercan assess evidence and return structured output- Graceful degradation: if LLM fails, returns safe fallback
- MockJudgeHandler works for testing without API calls
- Can run this in Python REPL:
import asyncio
import os
from src.utils.models import Evidence, Citation
from src.agent_factory.judges import JudgeHandler, MockJudgeHandler
# Test with mock (no API key needed)
async def test_mock():
handler = MockJudgeHandler()
evidence = [
Evidence(
content="Metformin shows neuroprotective effects in AD models",
citation=Citation(
source="pubmed",
title="Metformin and Alzheimer's",
url="https://pubmed.ncbi.nlm.nih.gov/12345/",
date="2024-01-01",
),
),
]
result = await handler.assess("metformin alzheimer", evidence)
print(f"Sufficient: {result.sufficient}")
print(f"Recommendation: {result.recommendation}")
print(f"Drug candidates: {result.details.drug_candidates}")
asyncio.run(test_mock())
# Test with real LLM (requires API key)
async def test_real():
os.environ["OPENAI_API_KEY"] = "your-key-here" # Or set in .env
handler = JudgeHandler()
evidence = [
Evidence(
content="Metformin shows neuroprotective effects in AD models...",
citation=Citation(
source="pubmed",
title="Metformin and Alzheimer's",
url="https://pubmed.ncbi.nlm.nih.gov/12345/",
date="2024-01-01",
),
),
]
result = await handler.assess("metformin alzheimer", evidence)
print(f"Sufficient: {result.sufficient}")
print(f"Confidence: {result.confidence}")
print(f"Reasoning: {result.reasoning}")
# asyncio.run(test_real()) # Uncomment with valid API key
Proceed to Phase 4 ONLY after all checkboxes are complete.