Spaces:

DataQuests
/

DeepCritical

Running

App Files Files Community

DeepCritical / docs /implementation /03_phase_judge.md

VibecoderMcSwaggins

docs: update implementation documentation for Phases 1-4

77627ff 19 days ago

preview code

raw

history blame

23.3 kB

	# Phase 3 Implementation Spec: Judge Vertical Slice

	Goal: Implement the "Brain" of the agent — evaluating evidence quality.
	Philosophy: "Structured Output or Bust."
	Prerequisite: Phase 2 complete (all search tests passing)

	---

	## 1. The Slice Definition

	This slice covers:
	1. Input: A user question + a list of `Evidence` (from Phase 2).
	2. Process:
	- Construct a prompt with the evidence.
	- Call LLM (PydanticAI / OpenAI / Anthropic).
	- Force JSON structured output.
	3. Output: A `JudgeAssessment` object.

	Files to Create:
	- `src/utils/models.py` - Add JudgeAssessment models (extend from Phase 2)
	- `src/prompts/judge.py` - Judge prompt templates
	- `src/agent_factory/judges.py` - JudgeHandler with PydanticAI
	- `tests/unit/agent_factory/test_judges.py` - Unit tests

	---

	## 2. Models (Add to `src/utils/models.py`)

	The output schema must be strict for reliable structured output.

	```python
	"""Add these models to src/utils/models.py (after Evidence models from Phase 2)."""
	from pydantic import BaseModel, Field
	from typing import List, Literal


	class AssessmentDetails(BaseModel):
	"""Detailed assessment of evidence quality."""

	mechanism_score: int = Field(
	...,
	ge=0,
	le=10,
	description="How well does the evidence explain the mechanism? 0-10"
	)
	mechanism_reasoning: str = Field(
	...,
	min_length=10,
	description="Explanation of mechanism score"
	)
	clinical_evidence_score: int = Field(
	...,
	ge=0,
	le=10,
	description="Strength of clinical/preclinical evidence. 0-10"
	)
	clinical_reasoning: str = Field(
	...,
	min_length=10,
	description="Explanation of clinical evidence score"
	)
	drug_candidates: List[str] = Field(
	default_factory=list,
	description="List of specific drug candidates mentioned"
	)
	key_findings: List[str] = Field(
	default_factory=list,
	description="Key findings from the evidence"
	)


	class JudgeAssessment(BaseModel):
	"""Complete assessment from the Judge."""

	details: AssessmentDetails
	sufficient: bool = Field(
	...,
	description="Is evidence sufficient to provide a recommendation?"
	)
	confidence: float = Field(
	...,
	ge=0.0,
	le=1.0,
	description="Confidence in the assessment (0-1)"
	)
	recommendation: Literal["continue", "synthesize"] = Field(
	...,
	description="continue = need more evidence, synthesize = ready to answer"
	)
	next_search_queries: List[str] = Field(
	default_factory=list,
	description="If continue, what queries to search next"
	)
	reasoning: str = Field(
	...,
	min_length=20,
	description="Overall reasoning for the recommendation"
	)
	```

	---

	## 3. Prompt Engineering (`src/prompts/judge.py`)

	We treat prompts as code. They should be versioned and clean.

	```python
	"""Judge prompts for evidence assessment."""
	from typing import List
	from src.utils.models import Evidence


	SYSTEM_PROMPT = """You are an expert drug repurposing research judge.

	Your task is to evaluate evidence from biomedical literature and determine if it's sufficient to recommend drug candidates for a given condition.

	## Evaluation Criteria

	1. Mechanism Score (0-10): How well does the evidence explain the biological mechanism?
	- 0-3: No clear mechanism, speculative
	- 4-6: Some mechanistic insight, but gaps exist
	- 7-10: Clear, well-supported mechanism of action

	2. Clinical Evidence Score (0-10): Strength of clinical/preclinical support?
	- 0-3: No clinical data, only theoretical
	- 4-6: Preclinical or early clinical data
	- 7-10: Strong clinical evidence (trials, meta-analyses)

	3. Sufficiency: Evidence is sufficient when:
	- Combined scores >= 12 AND
	- At least one specific drug candidate identified AND
	- Clear mechanistic rationale exists

	## Output Rules

	- Always output valid JSON matching the schema
	- Be conservative: only recommend "synthesize" when truly confident
	- If continuing, suggest specific, actionable search queries
	- Never hallucinate drug names or findings not in the evidence
	"""


	def format_user_prompt(question: str, evidence: List[Evidence]) -> str:
	"""
	Format the user prompt with question and evidence.

	Args:
	question: The user's research question
	evidence: List of Evidence objects from search

	Returns:
	Formatted prompt string
	"""
	evidence_text = "\n\n".join([
	f"### Evidence {i+1}\n"
	f"Source: {e.citation.source.upper()} - {e.citation.title}\n"
	f"URL: {e.citation.url}\n"
	f"Date: {e.citation.date}\n"
	f"Content:\n{e.content[:1500]}..."
	if len(e.content) > 1500 else
	f"### Evidence {i+1}\n"
	f"Source: {e.citation.source.upper()} - {e.citation.title}\n"
	f"URL: {e.citation.url}\n"
	f"Date: {e.citation.date}\n"
	f"Content:\n{e.content}"
	for i, e in enumerate(evidence)
	])

	return f"""## Research Question
	{question}

	## Available Evidence ({len(evidence)} sources)

	{evidence_text}

	## Your Task

	Evaluate this evidence and determine if it's sufficient to recommend drug repurposing candidates.
	Respond with a JSON object matching the JudgeAssessment schema.
	"""


	def format_empty_evidence_prompt(question: str) -> str:
	"""
	Format prompt when no evidence was found.

	Args:
	question: The user's research question

	Returns:
	Formatted prompt string
	"""
	return f"""## Research Question
	{question}

	## Available Evidence

	No evidence was found from the search.

	## Your Task

	Since no evidence was found, recommend search queries that might yield better results.
	Set sufficient=False and recommendation="continue".
	Suggest 3-5 specific search queries.
	"""
	```

	---

	## 4. JudgeHandler Implementation (`src/agent_factory/judges.py`)

	Using PydanticAI for structured output with retry logic.

	```python
	"""Judge handler for evidence assessment using PydanticAI."""
	import os
	from typing import List
	import structlog
	from pydantic_ai import Agent
	from pydantic_ai.models.openai import OpenAIModel
	from pydantic_ai.models.anthropic import AnthropicModel

	from src.utils.models import Evidence, JudgeAssessment, AssessmentDetails
	from src.utils.config import settings
	from src.prompts.judge import SYSTEM_PROMPT, format_user_prompt, format_empty_evidence_prompt

	logger = structlog.get_logger()


	def get_model():
	"""Get the LLM model based on configuration."""
	provider = getattr(settings, "llm_provider", "openai")

	if provider == "anthropic":
	return AnthropicModel(
	model_name=getattr(settings, "anthropic_model", "claude-3-5-sonnet-20241022"),
	api_key=os.getenv("ANTHROPIC_API_KEY"),
	)
	else:
	return OpenAIModel(
	model_name=getattr(settings, "openai_model", "gpt-4o"),
	api_key=os.getenv("OPENAI_API_KEY"),
	)


	class JudgeHandler:
	"""
	Handles evidence assessment using an LLM with structured output.

	Uses PydanticAI to ensure responses match the JudgeAssessment schema.
	"""

	def __init__(self, model=None):
	"""
	Initialize the JudgeHandler.

	Args:
	model: Optional PydanticAI model. If None, uses config default.
	"""
	self.model = model or get_model()
	self.agent = Agent(
	model=self.model,
	result_type=JudgeAssessment,
	system_prompt=SYSTEM_PROMPT,
	retries=3,
	)

	async def assess(
	self,
	question: str,
	evidence: List[Evidence],
	) -> JudgeAssessment:
	"""
	Assess evidence and determine if it's sufficient.

	Args:
	question: The user's research question
	evidence: List of Evidence objects from search

	Returns:
	JudgeAssessment with evaluation results

	Raises:
	JudgeError: If assessment fails after retries
	"""
	logger.info(
	"Starting evidence assessment",
	question=question[:100],
	evidence_count=len(evidence),
	)

	# Format the prompt based on whether we have evidence
	if evidence:
	user_prompt = format_user_prompt(question, evidence)
	else:
	user_prompt = format_empty_evidence_prompt(question)

	try:
	# Run the agent with structured output
	result = await self.agent.run(user_prompt)
	assessment = result.data

	logger.info(
	"Assessment complete",
	sufficient=assessment.sufficient,
	recommendation=assessment.recommendation,
	confidence=assessment.confidence,
	)

	return assessment

	except Exception as e:
	logger.error("Assessment failed", error=str(e))
	# Return a safe default assessment on failure
	return self._create_fallback_assessment(question, str(e))

	def _create_fallback_assessment(
	self,
	question: str,
	error: str,
	) -> JudgeAssessment:
	"""
	Create a fallback assessment when LLM fails.

	Args:
	question: The original question
	error: The error message

	Returns:
	Safe fallback JudgeAssessment
	"""
	return JudgeAssessment(
	details=AssessmentDetails(
	mechanism_score=0,
	mechanism_reasoning="Assessment failed due to LLM error",
	clinical_evidence_score=0,
	clinical_reasoning="Assessment failed due to LLM error",
	drug_candidates=[],
	key_findings=[],
	),
	sufficient=False,
	confidence=0.0,
	recommendation="continue",
	next_search_queries=[
	f"{question} mechanism",
	f"{question} clinical trials",
	f"{question} drug candidates",
	],
	reasoning=f"Assessment failed: {error}. Recommend retrying with refined queries.",
	)


	class MockJudgeHandler:
	"""
	Mock JudgeHandler for testing without LLM calls.

	Use this in unit tests to avoid API calls.
	"""

	def __init__(self, mock_response: JudgeAssessment \| None = None):
	"""
	Initialize with optional mock response.

	Args:
	mock_response: The assessment to return. If None, uses default.
	"""
	self.mock_response = mock_response
	self.call_count = 0
	self.last_question = None
	self.last_evidence = None

	async def assess(
	self,
	question: str,
	evidence: List[Evidence],
	) -> JudgeAssessment:
	"""Return the mock response."""
	self.call_count += 1
	self.last_question = question
	self.last_evidence = evidence

	if self.mock_response:
	return self.mock_response

	# Default mock response
	return JudgeAssessment(
	details=AssessmentDetails(
	mechanism_score=7,
	mechanism_reasoning="Mock assessment - good mechanism evidence",
	clinical_evidence_score=6,
	clinical_reasoning="Mock assessment - moderate clinical evidence",
	drug_candidates=["Drug A", "Drug B"],
	key_findings=["Finding 1", "Finding 2"],
	),
	sufficient=len(evidence) >= 3,
	confidence=0.75,
	recommendation="synthesize" if len(evidence) >= 3 else "continue",
	next_search_queries=["query 1", "query 2"] if len(evidence) < 3 else [],
	reasoning="Mock assessment for testing purposes",
	)
	```

	---

	## 5. TDD Workflow

	### Test File: `tests/unit/agent_factory/test_judges.py`

	```python
	"""Unit tests for JudgeHandler."""
	import pytest
	from unittest.mock import AsyncMock, MagicMock, patch

	from src.utils.models import (
	Evidence,
	Citation,
	JudgeAssessment,
	AssessmentDetails,
	)


	class TestJudgeHandler:
	"""Tests for JudgeHandler."""

	@pytest.mark.asyncio
	async def test_assess_returns_assessment(self):
	"""JudgeHandler should return JudgeAssessment from LLM."""
	from src.agent_factory.judges import JudgeHandler

	# Create mock assessment
	mock_assessment = JudgeAssessment(
	details=AssessmentDetails(
	mechanism_score=8,
	mechanism_reasoning="Strong mechanistic evidence",
	clinical_evidence_score=7,
	clinical_reasoning="Good clinical support",
	drug_candidates=["Metformin"],
	key_findings=["Neuroprotective effects"],
	),
	sufficient=True,
	confidence=0.85,
	recommendation="synthesize",
	next_search_queries=[],
	reasoning="Evidence is sufficient for synthesis",
	)

	# Mock the PydanticAI agent
	mock_result = MagicMock()
	mock_result.data = mock_assessment

	with patch("src.agent_factory.judges.Agent") as mock_agent_class:
	mock_agent = AsyncMock()
	mock_agent.run = AsyncMock(return_value=mock_result)
	mock_agent_class.return_value = mock_agent

	handler = JudgeHandler()
	# Replace the agent with our mock
	handler.agent = mock_agent

	evidence = [
	Evidence(
	content="Metformin shows neuroprotective properties...",
	citation=Citation(
	source="pubmed",
	title="Metformin in AD",
	url="https://pubmed.ncbi.nlm.nih.gov/12345/",
	date="2024-01-01",
	),
	)
	]

	result = await handler.assess("metformin alzheimer", evidence)

	assert result.sufficient is True
	assert result.recommendation == "synthesize"
	assert result.confidence == 0.85
	assert "Metformin" in result.details.drug_candidates

	@pytest.mark.asyncio
	async def test_assess_empty_evidence(self):
	"""JudgeHandler should handle empty evidence gracefully."""
	from src.agent_factory.judges import JudgeHandler

	mock_assessment = JudgeAssessment(
	details=AssessmentDetails(
	mechanism_score=0,
	mechanism_reasoning="No evidence to assess",
	clinical_evidence_score=0,
	clinical_reasoning="No evidence to assess",
	drug_candidates=[],
	key_findings=[],
	),
	sufficient=False,
	confidence=0.0,
	recommendation="continue",
	next_search_queries=["metformin alzheimer mechanism"],
	reasoning="No evidence found, need to search more",
	)

	mock_result = MagicMock()
	mock_result.data = mock_assessment

	with patch("src.agent_factory.judges.Agent") as mock_agent_class:
	mock_agent = AsyncMock()
	mock_agent.run = AsyncMock(return_value=mock_result)
	mock_agent_class.return_value = mock_agent

	handler = JudgeHandler()
	handler.agent = mock_agent

	result = await handler.assess("metformin alzheimer", [])

	assert result.sufficient is False
	assert result.recommendation == "continue"
	assert len(result.next_search_queries) > 0

	@pytest.mark.asyncio
	async def test_assess_handles_llm_failure(self):
	"""JudgeHandler should return fallback on LLM failure."""
	from src.agent_factory.judges import JudgeHandler

	with patch("src.agent_factory.judges.Agent") as mock_agent_class:
	mock_agent = AsyncMock()
	mock_agent.run = AsyncMock(side_effect=Exception("API Error"))
	mock_agent_class.return_value = mock_agent

	handler = JudgeHandler()
	handler.agent = mock_agent

	evidence = [
	Evidence(
	content="Some content",
	citation=Citation(
	source="pubmed",
	title="Title",
	url="url",
	date="2024",
	),
	)
	]

	result = await handler.assess("test question", evidence)

	# Should return fallback, not raise
	assert result.sufficient is False
	assert result.recommendation == "continue"
	assert "failed" in result.reasoning.lower()


	class TestMockJudgeHandler:
	"""Tests for MockJudgeHandler."""

	@pytest.mark.asyncio
	async def test_mock_handler_returns_default(self):
	"""MockJudgeHandler should return default assessment."""
	from src.agent_factory.judges import MockJudgeHandler

	handler = MockJudgeHandler()

	evidence = [
	Evidence(
	content="Content 1",
	citation=Citation(source="pubmed", title="T1", url="u1", date="2024"),
	),
	Evidence(
	content="Content 2",
	citation=Citation(source="web", title="T2", url="u2", date="2024"),
	),
	]

	result = await handler.assess("test", evidence)

	assert handler.call_count == 1
	assert handler.last_question == "test"
	assert len(handler.last_evidence) == 2
	assert result.details.mechanism_score == 7

	@pytest.mark.asyncio
	async def test_mock_handler_custom_response(self):
	"""MockJudgeHandler should return custom response when provided."""
	from src.agent_factory.judges import MockJudgeHandler

	custom_assessment = JudgeAssessment(
	details=AssessmentDetails(
	mechanism_score=10,
	mechanism_reasoning="Custom reasoning",
	clinical_evidence_score=10,
	clinical_reasoning="Custom clinical",
	drug_candidates=["CustomDrug"],
	key_findings=["Custom finding"],
	),
	sufficient=True,
	confidence=1.0,
	recommendation="synthesize",
	next_search_queries=[],
	reasoning="Custom assessment",
	)

	handler = MockJudgeHandler(mock_response=custom_assessment)
	result = await handler.assess("test", [])

	assert result.details.mechanism_score == 10
	assert result.details.drug_candidates == ["CustomDrug"]

	@pytest.mark.asyncio
	async def test_mock_handler_insufficient_with_few_evidence(self):
	"""MockJudgeHandler should recommend continue with < 3 evidence."""
	from src.agent_factory.judges import MockJudgeHandler

	handler = MockJudgeHandler()

	# Only 2 pieces of evidence
	evidence = [
	Evidence(
	content="Content",
	citation=Citation(source="pubmed", title="T", url="u", date="2024"),
	),
	Evidence(
	content="Content 2",
	citation=Citation(source="web", title="T2", url="u2", date="2024"),
	),
	]

	result = await handler.assess("test", evidence)

	assert result.sufficient is False
	assert result.recommendation == "continue"
	assert len(result.next_search_queries) > 0
	```

	---

	## 6. Dependencies

	Add to `pyproject.toml`:

	```toml
	[project]
	dependencies = [
	# ... existing deps ...
	"pydantic-ai>=0.0.16",
	"openai>=1.0.0",
	"anthropic>=0.18.0",
	]
	```

	---

	## 7. Configuration (`src/utils/config.py`)

	Add LLM configuration:

	```python
	"""Add to src/utils/config.py."""
	from pydantic_settings import BaseSettings
	from typing import Literal


	class Settings(BaseSettings):
	"""Application settings."""

	# LLM Configuration
	llm_provider: Literal["openai", "anthropic"] = "openai"
	openai_model: str = "gpt-4o"
	anthropic_model: str = "claude-3-5-sonnet-20241022"

	# API Keys (loaded from environment)
	openai_api_key: str \| None = None
	anthropic_api_key: str \| None = None
	ncbi_api_key: str \| None = None

	class Config:
	env_file = ".env"
	env_file_encoding = "utf-8"


	settings = Settings()
	```

	---

	## 8. Implementation Checklist

	- [ ] Add `AssessmentDetails` and `JudgeAssessment` models to `src/utils/models.py`
	- [ ] Create `src/prompts/__init__.py` (empty, for package)
	- [ ] Create `src/prompts/judge.py` with prompt templates
	- [ ] Create `src/agent_factory/__init__.py` with exports
	- [ ] Implement `src/agent_factory/judges.py` with JudgeHandler
	- [ ] Update `src/utils/config.py` with LLM settings
	- [ ] Create `tests/unit/agent_factory/__init__.py`
	- [ ] Write tests in `tests/unit/agent_factory/test_judges.py`
	- [ ] Run `uv run pytest tests/unit/agent_factory/ -v` — ALL TESTS MUST PASS
	- [ ] Commit: `git commit -m "feat: phase 3 judge slice complete"`

	---

	## 9. Definition of Done

	Phase 3 is COMPLETE when:

	1. All unit tests pass: `uv run pytest tests/unit/agent_factory/ -v`
	2. `JudgeHandler` can assess evidence and return structured output
	3. Graceful degradation: if LLM fails, returns safe fallback
	4. MockJudgeHandler works for testing without API calls
	5. Can run this in Python REPL:

	```python
	import asyncio
	import os
	from src.utils.models import Evidence, Citation
	from src.agent_factory.judges import JudgeHandler, MockJudgeHandler

	# Test with mock (no API key needed)
	async def test_mock():
	handler = MockJudgeHandler()
	evidence = [
	Evidence(
	content="Metformin shows neuroprotective effects in AD models",
	citation=Citation(
	source="pubmed",
	title="Metformin and Alzheimer's",
	url="https://pubmed.ncbi.nlm.nih.gov/12345/",
	date="2024-01-01",
	),
	),
	]
	result = await handler.assess("metformin alzheimer", evidence)
	print(f"Sufficient: {result.sufficient}")
	print(f"Recommendation: {result.recommendation}")
	print(f"Drug candidates: {result.details.drug_candidates}")

	asyncio.run(test_mock())

	# Test with real LLM (requires API key)
	async def test_real():
	os.environ["OPENAI_API_KEY"] = "your-key-here" # Or set in .env
	handler = JudgeHandler()
	evidence = [
	Evidence(
	content="Metformin shows neuroprotective effects in AD models...",
	citation=Citation(
	source="pubmed",
	title="Metformin and Alzheimer's",
	url="https://pubmed.ncbi.nlm.nih.gov/12345/",
	date="2024-01-01",
	),
	),
	]
	result = await handler.assess("metformin alzheimer", evidence)
	print(f"Sufficient: {result.sufficient}")
	print(f"Confidence: {result.confidence}")
	print(f"Reasoning: {result.reasoning}")

	# asyncio.run(test_real()) # Uncomment with valid API key
	```

	Proceed to Phase 4 ONLY after all checkboxes are complete.