Spaces:
Running
Running
Commit
Β·
3fcd8e7
1
Parent(s):
388cd05
feat(docs): update implementation roadmap and add specs for Phases 9-11
Browse files- Updated the implementation roadmap to reflect the completion of Phases 1-8.
- Added detailed specifications for Phase 9: Remove DuckDuckGo, Phase 10: ClinicalTrials.gov Integration, and Phase 11: bioRxiv Preprint Integration.
- Enhanced the status section to indicate the completion of Phases 1-8 and readiness for Phases 9-11.
docs/implementation/09_phase_source_cleanup.md
ADDED
|
@@ -0,0 +1,257 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Phase 9 Implementation Spec: Remove DuckDuckGo
|
| 2 |
+
|
| 3 |
+
**Goal**: Remove unreliable web search, focus on credible scientific sources.
|
| 4 |
+
**Philosophy**: "Scientific credibility over source quantity."
|
| 5 |
+
**Prerequisite**: Phase 8 complete (all agents working)
|
| 6 |
+
**Estimated Time**: 30-45 minutes
|
| 7 |
+
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
## 1. Why Remove DuckDuckGo?
|
| 11 |
+
|
| 12 |
+
### Current Problems
|
| 13 |
+
|
| 14 |
+
| Issue | Impact |
|
| 15 |
+
|-------|--------|
|
| 16 |
+
| Rate-limited aggressively | Returns 0 results frequently |
|
| 17 |
+
| Not peer-reviewed | Random blogs, news, misinformation |
|
| 18 |
+
| Not citable | Cannot use in scientific reports |
|
| 19 |
+
| Adds noise | Dilutes quality evidence |
|
| 20 |
+
|
| 21 |
+
### After Removal
|
| 22 |
+
|
| 23 |
+
| Benefit | Impact |
|
| 24 |
+
|---------|--------|
|
| 25 |
+
| Cleaner codebase | -150 lines of dead code |
|
| 26 |
+
| No rate limit failures | 100% source reliability |
|
| 27 |
+
| Scientific credibility | All sources peer-reviewed/preprint |
|
| 28 |
+
| Simpler debugging | Fewer failure modes |
|
| 29 |
+
|
| 30 |
+
---
|
| 31 |
+
|
| 32 |
+
## 2. Files to Modify/Delete
|
| 33 |
+
|
| 34 |
+
### 2.1 DELETE: `src/tools/websearch.py`
|
| 35 |
+
|
| 36 |
+
```bash
|
| 37 |
+
# File to delete entirely
|
| 38 |
+
src/tools/websearch.py # ~80 lines
|
| 39 |
+
```
|
| 40 |
+
|
| 41 |
+
### 2.2 MODIFY: SearchHandler Usage
|
| 42 |
+
|
| 43 |
+
Update all files that instantiate `SearchHandler` with `WebTool()`:
|
| 44 |
+
|
| 45 |
+
| File | Change |
|
| 46 |
+
|------|--------|
|
| 47 |
+
| `examples/search_demo/run_search.py` | Remove `WebTool()` from tools list |
|
| 48 |
+
| `examples/hypothesis_demo/run_hypothesis.py` | Remove `WebTool()` from tools list |
|
| 49 |
+
| `examples/full_stack_demo/run_full.py` | Remove `WebTool()` from tools list |
|
| 50 |
+
| `examples/orchestrator_demo/run_agent.py` | Remove `WebTool()` from tools list |
|
| 51 |
+
| `examples/orchestrator_demo/run_magentic.py` | Remove `WebTool()` from tools list |
|
| 52 |
+
|
| 53 |
+
### 2.3 MODIFY: Type Definitions
|
| 54 |
+
|
| 55 |
+
Update `src/utils/models.py`:
|
| 56 |
+
|
| 57 |
+
```python
|
| 58 |
+
# BEFORE
|
| 59 |
+
sources_searched: list[Literal["pubmed", "web"]]
|
| 60 |
+
|
| 61 |
+
# AFTER (Phase 9)
|
| 62 |
+
sources_searched: list[Literal["pubmed"]]
|
| 63 |
+
|
| 64 |
+
# AFTER (Phase 10-11)
|
| 65 |
+
sources_searched: list[Literal["pubmed", "clinicaltrials", "biorxiv"]]
|
| 66 |
+
```
|
| 67 |
+
|
| 68 |
+
### 2.4 DELETE: Tests for WebTool
|
| 69 |
+
|
| 70 |
+
```bash
|
| 71 |
+
# File to delete
|
| 72 |
+
tests/unit/tools/test_websearch.py
|
| 73 |
+
```
|
| 74 |
+
|
| 75 |
+
---
|
| 76 |
+
|
| 77 |
+
## 3. TDD Implementation
|
| 78 |
+
|
| 79 |
+
### 3.1 Test: SearchHandler Works Without WebTool
|
| 80 |
+
|
| 81 |
+
```python
|
| 82 |
+
# tests/unit/tools/test_search_handler.py
|
| 83 |
+
|
| 84 |
+
@pytest.mark.asyncio
|
| 85 |
+
async def test_search_handler_pubmed_only():
|
| 86 |
+
"""SearchHandler should work with only PubMed tool."""
|
| 87 |
+
from src.tools.pubmed import PubMedTool
|
| 88 |
+
from src.tools.search_handler import SearchHandler
|
| 89 |
+
|
| 90 |
+
handler = SearchHandler(tools=[PubMedTool()], timeout=30.0)
|
| 91 |
+
|
| 92 |
+
# Should not raise
|
| 93 |
+
result = await handler.execute("metformin diabetes", max_results_per_tool=3)
|
| 94 |
+
|
| 95 |
+
assert result.sources_searched == ["pubmed"]
|
| 96 |
+
assert "web" not in result.sources_searched
|
| 97 |
+
assert len(result.errors) == 0 # No failures
|
| 98 |
+
```
|
| 99 |
+
|
| 100 |
+
### 3.2 Test: WebTool Import Fails (Deleted)
|
| 101 |
+
|
| 102 |
+
```python
|
| 103 |
+
# tests/unit/tools/test_websearch_removed.py
|
| 104 |
+
|
| 105 |
+
def test_websearch_module_deleted():
|
| 106 |
+
"""WebTool should no longer exist."""
|
| 107 |
+
with pytest.raises(ImportError):
|
| 108 |
+
from src.tools.websearch import WebTool
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
### 3.3 Test: Examples Don't Reference WebTool
|
| 112 |
+
|
| 113 |
+
```python
|
| 114 |
+
# tests/unit/test_no_webtool_references.py
|
| 115 |
+
|
| 116 |
+
import ast
|
| 117 |
+
import pathlib
|
| 118 |
+
|
| 119 |
+
def test_examples_no_webtool_imports():
|
| 120 |
+
"""No example files should import WebTool."""
|
| 121 |
+
examples_dir = pathlib.Path("examples")
|
| 122 |
+
|
| 123 |
+
for py_file in examples_dir.rglob("*.py"):
|
| 124 |
+
content = py_file.read_text()
|
| 125 |
+
tree = ast.parse(content)
|
| 126 |
+
|
| 127 |
+
for node in ast.walk(tree):
|
| 128 |
+
if isinstance(node, ast.ImportFrom):
|
| 129 |
+
if node.module and "websearch" in node.module:
|
| 130 |
+
pytest.fail(f"{py_file} imports websearch (should be removed)")
|
| 131 |
+
if isinstance(node, ast.Import):
|
| 132 |
+
for alias in node.names:
|
| 133 |
+
if "websearch" in alias.name:
|
| 134 |
+
pytest.fail(f"{py_file} imports websearch (should be removed)")
|
| 135 |
+
```
|
| 136 |
+
|
| 137 |
+
---
|
| 138 |
+
|
| 139 |
+
## 4. Step-by-Step Implementation
|
| 140 |
+
|
| 141 |
+
### Step 1: Write Tests First (TDD)
|
| 142 |
+
|
| 143 |
+
```bash
|
| 144 |
+
# Create the test file
|
| 145 |
+
touch tests/unit/tools/test_websearch_removed.py
|
| 146 |
+
# Write the tests from section 3
|
| 147 |
+
```
|
| 148 |
+
|
| 149 |
+
### Step 2: Run Tests (Should Fail)
|
| 150 |
+
|
| 151 |
+
```bash
|
| 152 |
+
uv run pytest tests/unit/tools/test_websearch_removed.py -v
|
| 153 |
+
# Expected: FAIL (websearch still exists)
|
| 154 |
+
```
|
| 155 |
+
|
| 156 |
+
### Step 3: Delete WebTool
|
| 157 |
+
|
| 158 |
+
```bash
|
| 159 |
+
rm src/tools/websearch.py
|
| 160 |
+
rm tests/unit/tools/test_websearch.py
|
| 161 |
+
```
|
| 162 |
+
|
| 163 |
+
### Step 4: Update SearchHandler Usages
|
| 164 |
+
|
| 165 |
+
```python
|
| 166 |
+
# BEFORE (in each example file)
|
| 167 |
+
from src.tools.websearch import WebTool
|
| 168 |
+
search_handler = SearchHandler(tools=[PubMedTool(), WebTool()], timeout=30.0)
|
| 169 |
+
|
| 170 |
+
# AFTER
|
| 171 |
+
from src.tools.pubmed import PubMedTool
|
| 172 |
+
search_handler = SearchHandler(tools=[PubMedTool()], timeout=30.0)
|
| 173 |
+
```
|
| 174 |
+
|
| 175 |
+
### Step 5: Update Type Definitions
|
| 176 |
+
|
| 177 |
+
```python
|
| 178 |
+
# src/utils/models.py
|
| 179 |
+
# BEFORE
|
| 180 |
+
sources_searched: list[Literal["pubmed", "web"]]
|
| 181 |
+
|
| 182 |
+
# AFTER
|
| 183 |
+
sources_searched: list[Literal["pubmed"]]
|
| 184 |
+
```
|
| 185 |
+
|
| 186 |
+
### Step 6: Run All Tests
|
| 187 |
+
|
| 188 |
+
```bash
|
| 189 |
+
uv run pytest tests/unit/ -v
|
| 190 |
+
# Expected: ALL PASS
|
| 191 |
+
```
|
| 192 |
+
|
| 193 |
+
### Step 7: Run Lints
|
| 194 |
+
|
| 195 |
+
```bash
|
| 196 |
+
uv run ruff check src tests examples
|
| 197 |
+
uv run mypy src
|
| 198 |
+
# Expected: No errors
|
| 199 |
+
```
|
| 200 |
+
|
| 201 |
+
---
|
| 202 |
+
|
| 203 |
+
## 5. Definition of Done
|
| 204 |
+
|
| 205 |
+
Phase 9 is **COMPLETE** when:
|
| 206 |
+
|
| 207 |
+
- [ ] `src/tools/websearch.py` deleted
|
| 208 |
+
- [ ] `tests/unit/tools/test_websearch.py` deleted
|
| 209 |
+
- [ ] All example files updated (no WebTool imports)
|
| 210 |
+
- [ ] Type definitions updated in models.py
|
| 211 |
+
- [ ] New tests verify WebTool is removed
|
| 212 |
+
- [ ] All existing tests pass
|
| 213 |
+
- [ ] Lints pass
|
| 214 |
+
- [ ] Examples run successfully with PubMed only
|
| 215 |
+
|
| 216 |
+
---
|
| 217 |
+
|
| 218 |
+
## 6. Verification Commands
|
| 219 |
+
|
| 220 |
+
```bash
|
| 221 |
+
# 1. Verify websearch.py is gone
|
| 222 |
+
ls src/tools/websearch.py 2>&1 | grep "No such file"
|
| 223 |
+
|
| 224 |
+
# 2. Verify no WebTool imports remain
|
| 225 |
+
grep -r "WebTool" src/ examples/ && echo "FAIL: WebTool references found" || echo "PASS"
|
| 226 |
+
grep -r "websearch" src/ examples/ && echo "FAIL: websearch references found" || echo "PASS"
|
| 227 |
+
|
| 228 |
+
# 3. Run tests
|
| 229 |
+
uv run pytest tests/unit/ -v
|
| 230 |
+
|
| 231 |
+
# 4. Run example (should work)
|
| 232 |
+
source .env && uv run python examples/search_demo/run_search.py "metformin cancer"
|
| 233 |
+
```
|
| 234 |
+
|
| 235 |
+
---
|
| 236 |
+
|
| 237 |
+
## 7. Rollback Plan
|
| 238 |
+
|
| 239 |
+
If something breaks:
|
| 240 |
+
|
| 241 |
+
```bash
|
| 242 |
+
git checkout HEAD -- src/tools/websearch.py
|
| 243 |
+
git checkout HEAD -- tests/unit/tools/test_websearch.py
|
| 244 |
+
```
|
| 245 |
+
|
| 246 |
+
---
|
| 247 |
+
|
| 248 |
+
## 8. Value Delivered
|
| 249 |
+
|
| 250 |
+
| Before | After |
|
| 251 |
+
|--------|-------|
|
| 252 |
+
| 2 search sources (1 broken) | 1 reliable source |
|
| 253 |
+
| Rate limit failures | No failures |
|
| 254 |
+
| Web noise in results | Pure scientific sources |
|
| 255 |
+
| ~230 lines for websearch | 0 lines |
|
| 256 |
+
|
| 257 |
+
**Net effect**: Simpler, more reliable, more credible.
|
docs/implementation/10_phase_clinicaltrials.md
ADDED
|
@@ -0,0 +1,456 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Phase 10 Implementation Spec: ClinicalTrials.gov Integration
|
| 2 |
+
|
| 3 |
+
**Goal**: Add clinical trial search for drug repurposing evidence.
|
| 4 |
+
**Philosophy**: "Clinical trials are the bridge from hypothesis to therapy."
|
| 5 |
+
**Prerequisite**: Phase 9 complete (DuckDuckGo removed)
|
| 6 |
+
**Estimated Time**: 2-3 hours
|
| 7 |
+
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
## 1. Why ClinicalTrials.gov?
|
| 11 |
+
|
| 12 |
+
### Scientific Value
|
| 13 |
+
|
| 14 |
+
| Feature | Value for Drug Repurposing |
|
| 15 |
+
|---------|---------------------------|
|
| 16 |
+
| **400,000+ studies** | Massive evidence base |
|
| 17 |
+
| **Trial phase data** | Phase I/II/III = evidence strength |
|
| 18 |
+
| **Intervention details** | Exact drug + dosing |
|
| 19 |
+
| **Outcome measures** | What was measured |
|
| 20 |
+
| **Status tracking** | Completed vs recruiting |
|
| 21 |
+
| **Free API** | No cost, no key required |
|
| 22 |
+
|
| 23 |
+
### Example Query Response
|
| 24 |
+
|
| 25 |
+
Query: "metformin Alzheimer's"
|
| 26 |
+
|
| 27 |
+
```json
|
| 28 |
+
{
|
| 29 |
+
"studies": [
|
| 30 |
+
{
|
| 31 |
+
"nctId": "NCT04098666",
|
| 32 |
+
"briefTitle": "Metformin in Alzheimer's Dementia Prevention",
|
| 33 |
+
"phase": "Phase 2",
|
| 34 |
+
"status": "Recruiting",
|
| 35 |
+
"conditions": ["Alzheimer Disease"],
|
| 36 |
+
"interventions": ["Drug: Metformin"]
|
| 37 |
+
}
|
| 38 |
+
]
|
| 39 |
+
}
|
| 40 |
+
```
|
| 41 |
+
|
| 42 |
+
**This is GOLD for drug repurposing** - actual trials testing the hypothesis!
|
| 43 |
+
|
| 44 |
+
---
|
| 45 |
+
|
| 46 |
+
## 2. API Specification
|
| 47 |
+
|
| 48 |
+
### Endpoint
|
| 49 |
+
|
| 50 |
+
```
|
| 51 |
+
Base URL: https://clinicaltrials.gov/api/v2/studies
|
| 52 |
+
```
|
| 53 |
+
|
| 54 |
+
### Key Parameters
|
| 55 |
+
|
| 56 |
+
| Parameter | Description | Example |
|
| 57 |
+
|-----------|-------------|---------|
|
| 58 |
+
| `query.cond` | Condition/disease | `Alzheimer` |
|
| 59 |
+
| `query.intr` | Intervention/drug | `Metformin` |
|
| 60 |
+
| `query.term` | General search | `metformin alzheimer` |
|
| 61 |
+
| `pageSize` | Results per page | `20` |
|
| 62 |
+
| `fields` | Fields to return | See below |
|
| 63 |
+
|
| 64 |
+
### Fields We Need
|
| 65 |
+
|
| 66 |
+
```
|
| 67 |
+
NCTId, BriefTitle, Phase, OverallStatus, Condition,
|
| 68 |
+
InterventionName, StartDate, CompletionDate, BriefSummary
|
| 69 |
+
```
|
| 70 |
+
|
| 71 |
+
### Rate Limits
|
| 72 |
+
|
| 73 |
+
- ~50 requests/minute per IP
|
| 74 |
+
- No authentication required
|
| 75 |
+
- Paginated (100 results max per call)
|
| 76 |
+
|
| 77 |
+
### Documentation
|
| 78 |
+
|
| 79 |
+
- [API v2 Docs](https://clinicaltrials.gov/data-api/api)
|
| 80 |
+
- [Migration Guide](https://www.nlm.nih.gov/pubs/techbull/ma24/ma24_clinicaltrials_api.html)
|
| 81 |
+
|
| 82 |
+
---
|
| 83 |
+
|
| 84 |
+
## 3. Data Model
|
| 85 |
+
|
| 86 |
+
### 3.1 Update Citation Source Type (`src/utils/models.py`)
|
| 87 |
+
|
| 88 |
+
```python
|
| 89 |
+
# BEFORE
|
| 90 |
+
source: Literal["pubmed", "web"]
|
| 91 |
+
|
| 92 |
+
# AFTER
|
| 93 |
+
source: Literal["pubmed", "clinicaltrials", "biorxiv"]
|
| 94 |
+
```
|
| 95 |
+
|
| 96 |
+
### 3.2 Evidence from Clinical Trials
|
| 97 |
+
|
| 98 |
+
Clinical trial data maps to our existing `Evidence` model:
|
| 99 |
+
|
| 100 |
+
```python
|
| 101 |
+
Evidence(
|
| 102 |
+
content=f"{brief_summary}. Phase: {phase}. Status: {status}.",
|
| 103 |
+
citation=Citation(
|
| 104 |
+
source="clinicaltrials",
|
| 105 |
+
title=brief_title,
|
| 106 |
+
url=f"https://clinicaltrials.gov/study/{nct_id}",
|
| 107 |
+
date=start_date or "Unknown",
|
| 108 |
+
authors=[] # Trials don't have authors in the same way
|
| 109 |
+
),
|
| 110 |
+
relevance=0.8 # Trials are highly relevant for repurposing
|
| 111 |
+
)
|
| 112 |
+
```
|
| 113 |
+
|
| 114 |
+
---
|
| 115 |
+
|
| 116 |
+
## 4. Implementation
|
| 117 |
+
|
| 118 |
+
### 4.1 ClinicalTrials Tool (`src/tools/clinicaltrials.py`)
|
| 119 |
+
|
| 120 |
+
```python
|
| 121 |
+
"""ClinicalTrials.gov search tool using API v2."""
|
| 122 |
+
|
| 123 |
+
import httpx
|
| 124 |
+
from tenacity import retry, stop_after_attempt, wait_exponential
|
| 125 |
+
|
| 126 |
+
from src.utils.exceptions import SearchError
|
| 127 |
+
from src.utils.models import Citation, Evidence
|
| 128 |
+
|
| 129 |
+
|
| 130 |
+
class ClinicalTrialsTool:
|
| 131 |
+
"""Search tool for ClinicalTrials.gov."""
|
| 132 |
+
|
| 133 |
+
BASE_URL = "https://clinicaltrials.gov/api/v2/studies"
|
| 134 |
+
FIELDS = [
|
| 135 |
+
"NCTId",
|
| 136 |
+
"BriefTitle",
|
| 137 |
+
"Phase",
|
| 138 |
+
"OverallStatus",
|
| 139 |
+
"Condition",
|
| 140 |
+
"InterventionName",
|
| 141 |
+
"StartDate",
|
| 142 |
+
"BriefSummary",
|
| 143 |
+
]
|
| 144 |
+
|
| 145 |
+
@property
|
| 146 |
+
def name(self) -> str:
|
| 147 |
+
return "clinicaltrials"
|
| 148 |
+
|
| 149 |
+
@retry(
|
| 150 |
+
stop=stop_after_attempt(3),
|
| 151 |
+
wait=wait_exponential(multiplier=1, min=1, max=10),
|
| 152 |
+
reraise=True,
|
| 153 |
+
)
|
| 154 |
+
async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
|
| 155 |
+
"""
|
| 156 |
+
Search ClinicalTrials.gov for studies.
|
| 157 |
+
|
| 158 |
+
Args:
|
| 159 |
+
query: Search query (e.g., "metformin alzheimer")
|
| 160 |
+
max_results: Maximum results to return
|
| 161 |
+
|
| 162 |
+
Returns:
|
| 163 |
+
List of Evidence objects from clinical trials
|
| 164 |
+
"""
|
| 165 |
+
params = {
|
| 166 |
+
"query.term": query,
|
| 167 |
+
"pageSize": min(max_results, 100),
|
| 168 |
+
"fields": "|".join(self.FIELDS),
|
| 169 |
+
}
|
| 170 |
+
|
| 171 |
+
async with httpx.AsyncClient(timeout=30.0) as client:
|
| 172 |
+
try:
|
| 173 |
+
response = await client.get(self.BASE_URL, params=params)
|
| 174 |
+
response.raise_for_status()
|
| 175 |
+
except httpx.HTTPStatusError as e:
|
| 176 |
+
raise SearchError(f"ClinicalTrials.gov search failed: {e}") from e
|
| 177 |
+
|
| 178 |
+
data = response.json()
|
| 179 |
+
studies = data.get("studies", [])
|
| 180 |
+
|
| 181 |
+
return [self._study_to_evidence(study) for study in studies[:max_results]]
|
| 182 |
+
|
| 183 |
+
def _study_to_evidence(self, study: dict) -> Evidence:
|
| 184 |
+
"""Convert a clinical trial study to Evidence."""
|
| 185 |
+
# Navigate nested structure
|
| 186 |
+
protocol = study.get("protocolSection", {})
|
| 187 |
+
id_module = protocol.get("identificationModule", {})
|
| 188 |
+
status_module = protocol.get("statusModule", {})
|
| 189 |
+
desc_module = protocol.get("descriptionModule", {})
|
| 190 |
+
design_module = protocol.get("designModule", {})
|
| 191 |
+
conditions_module = protocol.get("conditionsModule", {})
|
| 192 |
+
arms_module = protocol.get("armsInterventionsModule", {})
|
| 193 |
+
|
| 194 |
+
nct_id = id_module.get("nctId", "Unknown")
|
| 195 |
+
title = id_module.get("briefTitle", "Untitled Study")
|
| 196 |
+
status = status_module.get("overallStatus", "Unknown")
|
| 197 |
+
start_date = status_module.get("startDateStruct", {}).get("date", "Unknown")
|
| 198 |
+
|
| 199 |
+
# Get phase (might be a list)
|
| 200 |
+
phases = design_module.get("phases", [])
|
| 201 |
+
phase = phases[0] if phases else "Not Applicable"
|
| 202 |
+
|
| 203 |
+
# Get conditions
|
| 204 |
+
conditions = conditions_module.get("conditions", [])
|
| 205 |
+
conditions_str = ", ".join(conditions[:3]) if conditions else "Unknown"
|
| 206 |
+
|
| 207 |
+
# Get interventions
|
| 208 |
+
interventions = arms_module.get("interventions", [])
|
| 209 |
+
intervention_names = [i.get("name", "") for i in interventions[:3]]
|
| 210 |
+
interventions_str = ", ".join(intervention_names) if intervention_names else "Unknown"
|
| 211 |
+
|
| 212 |
+
# Get summary
|
| 213 |
+
summary = desc_module.get("briefSummary", "No summary available.")
|
| 214 |
+
|
| 215 |
+
# Build content with key trial info
|
| 216 |
+
content = (
|
| 217 |
+
f"{summary[:500]}... "
|
| 218 |
+
f"Trial Phase: {phase}. "
|
| 219 |
+
f"Status: {status}. "
|
| 220 |
+
f"Conditions: {conditions_str}. "
|
| 221 |
+
f"Interventions: {interventions_str}."
|
| 222 |
+
)
|
| 223 |
+
|
| 224 |
+
return Evidence(
|
| 225 |
+
content=content[:2000],
|
| 226 |
+
citation=Citation(
|
| 227 |
+
source="clinicaltrials",
|
| 228 |
+
title=title[:500],
|
| 229 |
+
url=f"https://clinicaltrials.gov/study/{nct_id}",
|
| 230 |
+
date=start_date,
|
| 231 |
+
authors=[], # Trials don't have traditional authors
|
| 232 |
+
),
|
| 233 |
+
relevance=0.85, # Trials are highly relevant for repurposing
|
| 234 |
+
)
|
| 235 |
+
```
|
| 236 |
+
|
| 237 |
+
---
|
| 238 |
+
|
| 239 |
+
## 5. TDD Test Suite
|
| 240 |
+
|
| 241 |
+
### 5.1 Unit Tests (`tests/unit/tools/test_clinicaltrials.py`)
|
| 242 |
+
|
| 243 |
+
```python
|
| 244 |
+
"""Unit tests for ClinicalTrials.gov tool."""
|
| 245 |
+
|
| 246 |
+
import pytest
|
| 247 |
+
import respx
|
| 248 |
+
from httpx import Response
|
| 249 |
+
|
| 250 |
+
from src.tools.clinicaltrials import ClinicalTrialsTool
|
| 251 |
+
from src.utils.models import Evidence
|
| 252 |
+
|
| 253 |
+
|
| 254 |
+
@pytest.fixture
|
| 255 |
+
def mock_clinicaltrials_response():
|
| 256 |
+
"""Mock ClinicalTrials.gov API response."""
|
| 257 |
+
return {
|
| 258 |
+
"studies": [
|
| 259 |
+
{
|
| 260 |
+
"protocolSection": {
|
| 261 |
+
"identificationModule": {
|
| 262 |
+
"nctId": "NCT04098666",
|
| 263 |
+
"briefTitle": "Metformin in Alzheimer's Dementia Prevention"
|
| 264 |
+
},
|
| 265 |
+
"statusModule": {
|
| 266 |
+
"overallStatus": "Recruiting",
|
| 267 |
+
"startDateStruct": {"date": "2020-01-15"}
|
| 268 |
+
},
|
| 269 |
+
"descriptionModule": {
|
| 270 |
+
"briefSummary": "This study evaluates metformin for Alzheimer's prevention."
|
| 271 |
+
},
|
| 272 |
+
"designModule": {
|
| 273 |
+
"phases": ["PHASE2"]
|
| 274 |
+
},
|
| 275 |
+
"conditionsModule": {
|
| 276 |
+
"conditions": ["Alzheimer Disease", "Dementia"]
|
| 277 |
+
},
|
| 278 |
+
"armsInterventionsModule": {
|
| 279 |
+
"interventions": [
|
| 280 |
+
{"name": "Metformin", "type": "Drug"}
|
| 281 |
+
]
|
| 282 |
+
}
|
| 283 |
+
}
|
| 284 |
+
}
|
| 285 |
+
]
|
| 286 |
+
}
|
| 287 |
+
|
| 288 |
+
|
| 289 |
+
class TestClinicalTrialsTool:
|
| 290 |
+
"""Tests for ClinicalTrialsTool."""
|
| 291 |
+
|
| 292 |
+
def test_tool_name(self):
|
| 293 |
+
"""Tool should have correct name."""
|
| 294 |
+
tool = ClinicalTrialsTool()
|
| 295 |
+
assert tool.name == "clinicaltrials"
|
| 296 |
+
|
| 297 |
+
@pytest.mark.asyncio
|
| 298 |
+
@respx.mock
|
| 299 |
+
async def test_search_returns_evidence(self, mock_clinicaltrials_response):
|
| 300 |
+
"""Search should return Evidence objects."""
|
| 301 |
+
respx.get("https://clinicaltrials.gov/api/v2/studies").mock(
|
| 302 |
+
return_value=Response(200, json=mock_clinicaltrials_response)
|
| 303 |
+
)
|
| 304 |
+
|
| 305 |
+
tool = ClinicalTrialsTool()
|
| 306 |
+
results = await tool.search("metformin alzheimer", max_results=5)
|
| 307 |
+
|
| 308 |
+
assert len(results) == 1
|
| 309 |
+
assert isinstance(results[0], Evidence)
|
| 310 |
+
assert results[0].citation.source == "clinicaltrials"
|
| 311 |
+
assert "NCT04098666" in results[0].citation.url
|
| 312 |
+
assert "Metformin" in results[0].citation.title
|
| 313 |
+
|
| 314 |
+
@pytest.mark.asyncio
|
| 315 |
+
@respx.mock
|
| 316 |
+
async def test_search_extracts_phase(self, mock_clinicaltrials_response):
|
| 317 |
+
"""Search should extract trial phase."""
|
| 318 |
+
respx.get("https://clinicaltrials.gov/api/v2/studies").mock(
|
| 319 |
+
return_value=Response(200, json=mock_clinicaltrials_response)
|
| 320 |
+
)
|
| 321 |
+
|
| 322 |
+
tool = ClinicalTrialsTool()
|
| 323 |
+
results = await tool.search("metformin alzheimer")
|
| 324 |
+
|
| 325 |
+
assert "PHASE2" in results[0].content
|
| 326 |
+
|
| 327 |
+
@pytest.mark.asyncio
|
| 328 |
+
@respx.mock
|
| 329 |
+
async def test_search_extracts_status(self, mock_clinicaltrials_response):
|
| 330 |
+
"""Search should extract trial status."""
|
| 331 |
+
respx.get("https://clinicaltrials.gov/api/v2/studies").mock(
|
| 332 |
+
return_value=Response(200, json=mock_clinicaltrials_response)
|
| 333 |
+
)
|
| 334 |
+
|
| 335 |
+
tool = ClinicalTrialsTool()
|
| 336 |
+
results = await tool.search("metformin alzheimer")
|
| 337 |
+
|
| 338 |
+
assert "Recruiting" in results[0].content
|
| 339 |
+
|
| 340 |
+
@pytest.mark.asyncio
|
| 341 |
+
@respx.mock
|
| 342 |
+
async def test_search_empty_results(self):
|
| 343 |
+
"""Search should handle empty results gracefully."""
|
| 344 |
+
respx.get("https://clinicaltrials.gov/api/v2/studies").mock(
|
| 345 |
+
return_value=Response(200, json={"studies": []})
|
| 346 |
+
)
|
| 347 |
+
|
| 348 |
+
tool = ClinicalTrialsTool()
|
| 349 |
+
results = await tool.search("nonexistent query xyz")
|
| 350 |
+
|
| 351 |
+
assert results == []
|
| 352 |
+
|
| 353 |
+
@pytest.mark.asyncio
|
| 354 |
+
@respx.mock
|
| 355 |
+
async def test_search_api_error(self):
|
| 356 |
+
"""Search should raise SearchError on API failure."""
|
| 357 |
+
from src.utils.exceptions import SearchError
|
| 358 |
+
|
| 359 |
+
respx.get("https://clinicaltrials.gov/api/v2/studies").mock(
|
| 360 |
+
return_value=Response(500, text="Internal Server Error")
|
| 361 |
+
)
|
| 362 |
+
|
| 363 |
+
tool = ClinicalTrialsTool()
|
| 364 |
+
|
| 365 |
+
with pytest.raises(SearchError):
|
| 366 |
+
await tool.search("metformin alzheimer")
|
| 367 |
+
|
| 368 |
+
|
| 369 |
+
class TestClinicalTrialsIntegration:
|
| 370 |
+
"""Integration tests (marked for separate run)."""
|
| 371 |
+
|
| 372 |
+
@pytest.mark.integration
|
| 373 |
+
@pytest.mark.asyncio
|
| 374 |
+
async def test_real_api_call(self):
|
| 375 |
+
"""Test actual API call (requires network)."""
|
| 376 |
+
tool = ClinicalTrialsTool()
|
| 377 |
+
results = await tool.search("metformin diabetes", max_results=3)
|
| 378 |
+
|
| 379 |
+
assert len(results) > 0
|
| 380 |
+
assert all(isinstance(r, Evidence) for r in results)
|
| 381 |
+
assert all(r.citation.source == "clinicaltrials" for r in results)
|
| 382 |
+
```
|
| 383 |
+
|
| 384 |
+
---
|
| 385 |
+
|
| 386 |
+
## 6. Integration with SearchHandler
|
| 387 |
+
|
| 388 |
+
### 6.1 Update Example Files
|
| 389 |
+
|
| 390 |
+
```python
|
| 391 |
+
# examples/search_demo/run_search.py
|
| 392 |
+
from src.tools.clinicaltrials import ClinicalTrialsTool
|
| 393 |
+
from src.tools.pubmed import PubMedTool
|
| 394 |
+
from src.tools.search_handler import SearchHandler
|
| 395 |
+
|
| 396 |
+
search_handler = SearchHandler(
|
| 397 |
+
tools=[PubMedTool(), ClinicalTrialsTool()],
|
| 398 |
+
timeout=30.0
|
| 399 |
+
)
|
| 400 |
+
```
|
| 401 |
+
|
| 402 |
+
### 6.2 Update SearchResult Type
|
| 403 |
+
|
| 404 |
+
```python
|
| 405 |
+
# src/utils/models.py
|
| 406 |
+
sources_searched: list[Literal["pubmed", "clinicaltrials"]]
|
| 407 |
+
```
|
| 408 |
+
|
| 409 |
+
---
|
| 410 |
+
|
| 411 |
+
## 7. Definition of Done
|
| 412 |
+
|
| 413 |
+
Phase 10 is **COMPLETE** when:
|
| 414 |
+
|
| 415 |
+
- [ ] `src/tools/clinicaltrials.py` implemented
|
| 416 |
+
- [ ] Unit tests in `tests/unit/tools/test_clinicaltrials.py`
|
| 417 |
+
- [ ] Integration test marked with `@pytest.mark.integration`
|
| 418 |
+
- [ ] SearchHandler updated to include ClinicalTrialsTool
|
| 419 |
+
- [ ] Type definitions updated in models.py
|
| 420 |
+
- [ ] Example files updated
|
| 421 |
+
- [ ] All unit tests pass
|
| 422 |
+
- [ ] Lints pass
|
| 423 |
+
- [ ] Manual verification with real API
|
| 424 |
+
|
| 425 |
+
---
|
| 426 |
+
|
| 427 |
+
## 8. Verification Commands
|
| 428 |
+
|
| 429 |
+
```bash
|
| 430 |
+
# 1. Run unit tests
|
| 431 |
+
uv run pytest tests/unit/tools/test_clinicaltrials.py -v
|
| 432 |
+
|
| 433 |
+
# 2. Run integration test (requires network)
|
| 434 |
+
uv run pytest tests/unit/tools/test_clinicaltrials.py -v -m integration
|
| 435 |
+
|
| 436 |
+
# 3. Run full test suite
|
| 437 |
+
uv run pytest tests/unit/ -v
|
| 438 |
+
|
| 439 |
+
# 4. Run example
|
| 440 |
+
source .env && uv run python examples/search_demo/run_search.py "metformin alzheimer"
|
| 441 |
+
# Should show results from BOTH PubMed AND ClinicalTrials.gov
|
| 442 |
+
```
|
| 443 |
+
|
| 444 |
+
---
|
| 445 |
+
|
| 446 |
+
## 9. Value Delivered
|
| 447 |
+
|
| 448 |
+
| Before | After |
|
| 449 |
+
|--------|-------|
|
| 450 |
+
| Papers only | Papers + Clinical Trials |
|
| 451 |
+
| "Drug X might help" | "Drug X is in Phase II trial" |
|
| 452 |
+
| No trial status | Recruiting/Completed/Terminated |
|
| 453 |
+
| No phase info | Phase I/II/III evidence strength |
|
| 454 |
+
|
| 455 |
+
**Demo pitch addition**:
|
| 456 |
+
> "DeepCritical searches PubMed for peer-reviewed evidence AND ClinicalTrials.gov for 400,000+ clinical trials."
|
docs/implementation/11_phase_biorxiv.md
ADDED
|
@@ -0,0 +1,572 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Phase 11 Implementation Spec: bioRxiv Preprint Integration
|
| 2 |
+
|
| 3 |
+
**Goal**: Add cutting-edge preprint search for the latest research.
|
| 4 |
+
**Philosophy**: "Preprints are where breakthroughs appear first."
|
| 5 |
+
**Prerequisite**: Phase 10 complete (ClinicalTrials.gov working)
|
| 6 |
+
**Estimated Time**: 2-3 hours
|
| 7 |
+
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
## 1. Why bioRxiv?
|
| 11 |
+
|
| 12 |
+
### Scientific Value
|
| 13 |
+
|
| 14 |
+
| Feature | Value for Drug Repurposing |
|
| 15 |
+
|---------|---------------------------|
|
| 16 |
+
| **Cutting-edge research** | 6-12 months ahead of PubMed |
|
| 17 |
+
| **Rapid publication** | Days, not months |
|
| 18 |
+
| **Free full-text** | Complete papers, not just abstracts |
|
| 19 |
+
| **medRxiv included** | Medical preprints via same API |
|
| 20 |
+
| **No API key required** | Free and open |
|
| 21 |
+
|
| 22 |
+
### The Preprint Advantage
|
| 23 |
+
|
| 24 |
+
```
|
| 25 |
+
Traditional Publication Timeline:
|
| 26 |
+
Research β Submit β Review β Revise β Accept β Publish
|
| 27 |
+
|___________________________ 6-18 months _______________|
|
| 28 |
+
|
| 29 |
+
Preprint Timeline:
|
| 30 |
+
Research β Upload β Available
|
| 31 |
+
|______ 1-3 days ______|
|
| 32 |
+
```
|
| 33 |
+
|
| 34 |
+
**For drug repurposing**: Preprints contain the newest hypotheses and evidence!
|
| 35 |
+
|
| 36 |
+
---
|
| 37 |
+
|
| 38 |
+
## 2. API Specification
|
| 39 |
+
|
| 40 |
+
### Endpoint
|
| 41 |
+
|
| 42 |
+
```
|
| 43 |
+
Base URL: https://api.biorxiv.org/details/[server]/[interval]/[cursor]/[format]
|
| 44 |
+
```
|
| 45 |
+
|
| 46 |
+
### Servers
|
| 47 |
+
|
| 48 |
+
| Server | Content |
|
| 49 |
+
|--------|---------|
|
| 50 |
+
| `biorxiv` | Biology preprints |
|
| 51 |
+
| `medrxiv` | Medical preprints (more relevant for us!) |
|
| 52 |
+
|
| 53 |
+
### Interval Formats
|
| 54 |
+
|
| 55 |
+
| Format | Example | Description |
|
| 56 |
+
|--------|---------|-------------|
|
| 57 |
+
| Date range | `2024-01-01/2024-12-31` | Papers between dates |
|
| 58 |
+
| Recent N | `50` | Most recent N papers |
|
| 59 |
+
| Recent N days | `30d` | Papers from last N days |
|
| 60 |
+
|
| 61 |
+
### Response Format
|
| 62 |
+
|
| 63 |
+
```json
|
| 64 |
+
{
|
| 65 |
+
"collection": [
|
| 66 |
+
{
|
| 67 |
+
"doi": "10.1101/2024.01.15.123456",
|
| 68 |
+
"title": "Metformin repurposing for neurodegeneration",
|
| 69 |
+
"authors": "Smith, J; Jones, A",
|
| 70 |
+
"date": "2024-01-15",
|
| 71 |
+
"category": "neuroscience",
|
| 72 |
+
"abstract": "We investigated metformin's potential..."
|
| 73 |
+
}
|
| 74 |
+
],
|
| 75 |
+
"messages": [{"status": "ok", "count": 100}]
|
| 76 |
+
}
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
### Rate Limits
|
| 80 |
+
|
| 81 |
+
- No official limit, but be respectful
|
| 82 |
+
- Results paginated (100 per call)
|
| 83 |
+
- Use cursor for pagination
|
| 84 |
+
|
| 85 |
+
### Documentation
|
| 86 |
+
|
| 87 |
+
- [bioRxiv API](https://api.biorxiv.org/)
|
| 88 |
+
- [medrxivr R package docs](https://docs.ropensci.org/medrxivr/)
|
| 89 |
+
|
| 90 |
+
---
|
| 91 |
+
|
| 92 |
+
## 3. Search Strategy
|
| 93 |
+
|
| 94 |
+
### Challenge: bioRxiv API Limitations
|
| 95 |
+
|
| 96 |
+
The bioRxiv API does NOT support keyword search directly. It returns papers by:
|
| 97 |
+
- Date range
|
| 98 |
+
- Recent count
|
| 99 |
+
|
| 100 |
+
### Solution: Client-Side Filtering
|
| 101 |
+
|
| 102 |
+
```python
|
| 103 |
+
# Strategy:
|
| 104 |
+
# 1. Fetch recent papers (e.g., last 90 days)
|
| 105 |
+
# 2. Filter by keyword matching in title/abstract
|
| 106 |
+
# 3. Use embeddings for semantic matching (leverage Phase 6!)
|
| 107 |
+
```
|
| 108 |
+
|
| 109 |
+
### Alternative: Content Search Endpoint
|
| 110 |
+
|
| 111 |
+
```
|
| 112 |
+
https://api.biorxiv.org/pubs/[server]/[doi_prefix]
|
| 113 |
+
```
|
| 114 |
+
|
| 115 |
+
For searching, we can use the publisher endpoint with filtering.
|
| 116 |
+
|
| 117 |
+
---
|
| 118 |
+
|
| 119 |
+
## 4. Data Model
|
| 120 |
+
|
| 121 |
+
### 4.1 Update Citation Source Type (`src/utils/models.py`)
|
| 122 |
+
|
| 123 |
+
```python
|
| 124 |
+
# After Phase 11
|
| 125 |
+
source: Literal["pubmed", "clinicaltrials", "biorxiv"]
|
| 126 |
+
```
|
| 127 |
+
|
| 128 |
+
### 4.2 Evidence from Preprints
|
| 129 |
+
|
| 130 |
+
```python
|
| 131 |
+
Evidence(
|
| 132 |
+
content=abstract[:2000],
|
| 133 |
+
citation=Citation(
|
| 134 |
+
source="biorxiv", # or "medrxiv"
|
| 135 |
+
title=title,
|
| 136 |
+
url=f"https://doi.org/{doi}",
|
| 137 |
+
date=date,
|
| 138 |
+
authors=authors.split("; ")[:5]
|
| 139 |
+
),
|
| 140 |
+
relevance=0.75 # Preprints slightly lower than peer-reviewed
|
| 141 |
+
)
|
| 142 |
+
```
|
| 143 |
+
|
| 144 |
+
---
|
| 145 |
+
|
| 146 |
+
## 5. Implementation
|
| 147 |
+
|
| 148 |
+
### 5.1 bioRxiv Tool (`src/tools/biorxiv.py`)
|
| 149 |
+
|
| 150 |
+
```python
|
| 151 |
+
"""bioRxiv/medRxiv preprint search tool."""
|
| 152 |
+
|
| 153 |
+
import re
|
| 154 |
+
from datetime import datetime, timedelta
|
| 155 |
+
|
| 156 |
+
import httpx
|
| 157 |
+
from tenacity import retry, stop_after_attempt, wait_exponential
|
| 158 |
+
|
| 159 |
+
from src.utils.exceptions import SearchError
|
| 160 |
+
from src.utils.models import Citation, Evidence
|
| 161 |
+
|
| 162 |
+
|
| 163 |
+
class BioRxivTool:
|
| 164 |
+
"""Search tool for bioRxiv and medRxiv preprints."""
|
| 165 |
+
|
| 166 |
+
BASE_URL = "https://api.biorxiv.org/details"
|
| 167 |
+
# Use medRxiv for medical/clinical content (more relevant for drug repurposing)
|
| 168 |
+
DEFAULT_SERVER = "medrxiv"
|
| 169 |
+
# Fetch papers from last N days
|
| 170 |
+
DEFAULT_DAYS = 90
|
| 171 |
+
|
| 172 |
+
def __init__(self, server: str = DEFAULT_SERVER, days: int = DEFAULT_DAYS):
|
| 173 |
+
"""
|
| 174 |
+
Initialize bioRxiv tool.
|
| 175 |
+
|
| 176 |
+
Args:
|
| 177 |
+
server: "biorxiv" or "medrxiv"
|
| 178 |
+
days: How many days back to search
|
| 179 |
+
"""
|
| 180 |
+
self.server = server
|
| 181 |
+
self.days = days
|
| 182 |
+
|
| 183 |
+
@property
|
| 184 |
+
def name(self) -> str:
|
| 185 |
+
return "biorxiv"
|
| 186 |
+
|
| 187 |
+
@retry(
|
| 188 |
+
stop=stop_after_attempt(3),
|
| 189 |
+
wait=wait_exponential(multiplier=1, min=1, max=10),
|
| 190 |
+
reraise=True,
|
| 191 |
+
)
|
| 192 |
+
async def search(self, query: str, max_results: int = 10) -> list[Evidence]:
|
| 193 |
+
"""
|
| 194 |
+
Search bioRxiv/medRxiv for preprints matching query.
|
| 195 |
+
|
| 196 |
+
Note: bioRxiv API doesn't support keyword search directly.
|
| 197 |
+
We fetch recent papers and filter client-side.
|
| 198 |
+
|
| 199 |
+
Args:
|
| 200 |
+
query: Search query (keywords)
|
| 201 |
+
max_results: Maximum results to return
|
| 202 |
+
|
| 203 |
+
Returns:
|
| 204 |
+
List of Evidence objects from preprints
|
| 205 |
+
"""
|
| 206 |
+
# Build date range for last N days
|
| 207 |
+
end_date = datetime.now().strftime("%Y-%m-%d")
|
| 208 |
+
start_date = (datetime.now() - timedelta(days=self.days)).strftime("%Y-%m-%d")
|
| 209 |
+
interval = f"{start_date}/{end_date}"
|
| 210 |
+
|
| 211 |
+
# Fetch recent papers
|
| 212 |
+
url = f"{self.BASE_URL}/{self.server}/{interval}/0/json"
|
| 213 |
+
|
| 214 |
+
async with httpx.AsyncClient(timeout=30.0) as client:
|
| 215 |
+
try:
|
| 216 |
+
response = await client.get(url)
|
| 217 |
+
response.raise_for_status()
|
| 218 |
+
except httpx.HTTPStatusError as e:
|
| 219 |
+
raise SearchError(f"bioRxiv search failed: {e}") from e
|
| 220 |
+
|
| 221 |
+
data = response.json()
|
| 222 |
+
papers = data.get("collection", [])
|
| 223 |
+
|
| 224 |
+
# Filter papers by query keywords
|
| 225 |
+
query_terms = self._extract_terms(query)
|
| 226 |
+
matching = self._filter_by_keywords(papers, query_terms, max_results)
|
| 227 |
+
|
| 228 |
+
return [self._paper_to_evidence(paper) for paper in matching]
|
| 229 |
+
|
| 230 |
+
def _extract_terms(self, query: str) -> list[str]:
|
| 231 |
+
"""Extract search terms from query."""
|
| 232 |
+
# Simple tokenization, lowercase
|
| 233 |
+
terms = re.findall(r'\b\w+\b', query.lower())
|
| 234 |
+
# Filter out common stop words
|
| 235 |
+
stop_words = {'the', 'a', 'an', 'in', 'on', 'for', 'and', 'or', 'of', 'to'}
|
| 236 |
+
return [t for t in terms if t not in stop_words and len(t) > 2]
|
| 237 |
+
|
| 238 |
+
def _filter_by_keywords(
|
| 239 |
+
self, papers: list[dict], terms: list[str], max_results: int
|
| 240 |
+
) -> list[dict]:
|
| 241 |
+
"""Filter papers that contain query terms in title or abstract."""
|
| 242 |
+
scored_papers = []
|
| 243 |
+
|
| 244 |
+
for paper in papers:
|
| 245 |
+
title = paper.get("title", "").lower()
|
| 246 |
+
abstract = paper.get("abstract", "").lower()
|
| 247 |
+
text = f"{title} {abstract}"
|
| 248 |
+
|
| 249 |
+
# Count matching terms
|
| 250 |
+
matches = sum(1 for term in terms if term in text)
|
| 251 |
+
|
| 252 |
+
if matches > 0:
|
| 253 |
+
scored_papers.append((matches, paper))
|
| 254 |
+
|
| 255 |
+
# Sort by match count (descending)
|
| 256 |
+
scored_papers.sort(key=lambda x: x[0], reverse=True)
|
| 257 |
+
|
| 258 |
+
return [paper for _, paper in scored_papers[:max_results]]
|
| 259 |
+
|
| 260 |
+
def _paper_to_evidence(self, paper: dict) -> Evidence:
|
| 261 |
+
"""Convert a preprint paper to Evidence."""
|
| 262 |
+
doi = paper.get("doi", "")
|
| 263 |
+
title = paper.get("title", "Untitled")
|
| 264 |
+
authors_str = paper.get("authors", "Unknown")
|
| 265 |
+
date = paper.get("date", "Unknown")
|
| 266 |
+
abstract = paper.get("abstract", "No abstract available.")
|
| 267 |
+
category = paper.get("category", "")
|
| 268 |
+
|
| 269 |
+
# Parse authors (format: "Smith, J; Jones, A")
|
| 270 |
+
authors = [a.strip() for a in authors_str.split(";")][:5]
|
| 271 |
+
|
| 272 |
+
# Note this is a preprint in the content
|
| 273 |
+
content = (
|
| 274 |
+
f"[PREPRINT - Not peer-reviewed] "
|
| 275 |
+
f"{abstract[:1800]}... "
|
| 276 |
+
f"Category: {category}."
|
| 277 |
+
)
|
| 278 |
+
|
| 279 |
+
return Evidence(
|
| 280 |
+
content=content[:2000],
|
| 281 |
+
citation=Citation(
|
| 282 |
+
source="biorxiv",
|
| 283 |
+
title=title[:500],
|
| 284 |
+
url=f"https://doi.org/{doi}" if doi else f"https://www.medrxiv.org/",
|
| 285 |
+
date=date,
|
| 286 |
+
authors=authors,
|
| 287 |
+
),
|
| 288 |
+
relevance=0.75, # Slightly lower than peer-reviewed
|
| 289 |
+
)
|
| 290 |
+
```
|
| 291 |
+
|
| 292 |
+
---
|
| 293 |
+
|
| 294 |
+
## 6. TDD Test Suite
|
| 295 |
+
|
| 296 |
+
### 6.1 Unit Tests (`tests/unit/tools/test_biorxiv.py`)
|
| 297 |
+
|
| 298 |
+
```python
|
| 299 |
+
"""Unit tests for bioRxiv tool."""
|
| 300 |
+
|
| 301 |
+
import pytest
|
| 302 |
+
import respx
|
| 303 |
+
from httpx import Response
|
| 304 |
+
|
| 305 |
+
from src.tools.biorxiv import BioRxivTool
|
| 306 |
+
from src.utils.models import Evidence
|
| 307 |
+
|
| 308 |
+
|
| 309 |
+
@pytest.fixture
|
| 310 |
+
def mock_biorxiv_response():
|
| 311 |
+
"""Mock bioRxiv API response."""
|
| 312 |
+
return {
|
| 313 |
+
"collection": [
|
| 314 |
+
{
|
| 315 |
+
"doi": "10.1101/2024.01.15.24301234",
|
| 316 |
+
"title": "Metformin repurposing for Alzheimer's disease: a systematic review",
|
| 317 |
+
"authors": "Smith, John; Jones, Alice; Brown, Bob",
|
| 318 |
+
"date": "2024-01-15",
|
| 319 |
+
"category": "neurology",
|
| 320 |
+
"abstract": "Background: Metformin has shown neuroprotective effects. "
|
| 321 |
+
"We conducted a systematic review of metformin's potential "
|
| 322 |
+
"for Alzheimer's disease treatment."
|
| 323 |
+
},
|
| 324 |
+
{
|
| 325 |
+
"doi": "10.1101/2024.01.10.24301111",
|
| 326 |
+
"title": "COVID-19 vaccine efficacy study",
|
| 327 |
+
"authors": "Wilson, C",
|
| 328 |
+
"date": "2024-01-10",
|
| 329 |
+
"category": "infectious diseases",
|
| 330 |
+
"abstract": "This study evaluates COVID-19 vaccine efficacy."
|
| 331 |
+
}
|
| 332 |
+
],
|
| 333 |
+
"messages": [{"status": "ok", "count": 2}]
|
| 334 |
+
}
|
| 335 |
+
|
| 336 |
+
|
| 337 |
+
class TestBioRxivTool:
|
| 338 |
+
"""Tests for BioRxivTool."""
|
| 339 |
+
|
| 340 |
+
def test_tool_name(self):
|
| 341 |
+
"""Tool should have correct name."""
|
| 342 |
+
tool = BioRxivTool()
|
| 343 |
+
assert tool.name == "biorxiv"
|
| 344 |
+
|
| 345 |
+
def test_default_server_is_medrxiv(self):
|
| 346 |
+
"""Default server should be medRxiv for medical relevance."""
|
| 347 |
+
tool = BioRxivTool()
|
| 348 |
+
assert tool.server == "medrxiv"
|
| 349 |
+
|
| 350 |
+
@pytest.mark.asyncio
|
| 351 |
+
@respx.mock
|
| 352 |
+
async def test_search_returns_evidence(self, mock_biorxiv_response):
|
| 353 |
+
"""Search should return Evidence objects."""
|
| 354 |
+
respx.get(url__startswith="https://api.biorxiv.org/details").mock(
|
| 355 |
+
return_value=Response(200, json=mock_biorxiv_response)
|
| 356 |
+
)
|
| 357 |
+
|
| 358 |
+
tool = BioRxivTool()
|
| 359 |
+
results = await tool.search("metformin alzheimer", max_results=5)
|
| 360 |
+
|
| 361 |
+
assert len(results) == 1 # Only the matching paper
|
| 362 |
+
assert isinstance(results[0], Evidence)
|
| 363 |
+
assert results[0].citation.source == "biorxiv"
|
| 364 |
+
assert "metformin" in results[0].citation.title.lower()
|
| 365 |
+
|
| 366 |
+
@pytest.mark.asyncio
|
| 367 |
+
@respx.mock
|
| 368 |
+
async def test_search_filters_by_keywords(self, mock_biorxiv_response):
|
| 369 |
+
"""Search should filter papers by query keywords."""
|
| 370 |
+
respx.get(url__startswith="https://api.biorxiv.org/details").mock(
|
| 371 |
+
return_value=Response(200, json=mock_biorxiv_response)
|
| 372 |
+
)
|
| 373 |
+
|
| 374 |
+
tool = BioRxivTool()
|
| 375 |
+
|
| 376 |
+
# Search for metformin - should match first paper
|
| 377 |
+
results = await tool.search("metformin")
|
| 378 |
+
assert len(results) == 1
|
| 379 |
+
assert "metformin" in results[0].citation.title.lower()
|
| 380 |
+
|
| 381 |
+
# Search for COVID - should match second paper
|
| 382 |
+
results = await tool.search("covid vaccine")
|
| 383 |
+
assert len(results) == 1
|
| 384 |
+
assert "covid" in results[0].citation.title.lower()
|
| 385 |
+
|
| 386 |
+
@pytest.mark.asyncio
|
| 387 |
+
@respx.mock
|
| 388 |
+
async def test_search_marks_as_preprint(self, mock_biorxiv_response):
|
| 389 |
+
"""Evidence content should note it's a preprint."""
|
| 390 |
+
respx.get(url__startswith="https://api.biorxiv.org/details").mock(
|
| 391 |
+
return_value=Response(200, json=mock_biorxiv_response)
|
| 392 |
+
)
|
| 393 |
+
|
| 394 |
+
tool = BioRxivTool()
|
| 395 |
+
results = await tool.search("metformin")
|
| 396 |
+
|
| 397 |
+
assert "PREPRINT" in results[0].content
|
| 398 |
+
assert "Not peer-reviewed" in results[0].content
|
| 399 |
+
|
| 400 |
+
@pytest.mark.asyncio
|
| 401 |
+
@respx.mock
|
| 402 |
+
async def test_search_empty_results(self):
|
| 403 |
+
"""Search should handle empty results gracefully."""
|
| 404 |
+
respx.get(url__startswith="https://api.biorxiv.org/details").mock(
|
| 405 |
+
return_value=Response(200, json={"collection": [], "messages": []})
|
| 406 |
+
)
|
| 407 |
+
|
| 408 |
+
tool = BioRxivTool()
|
| 409 |
+
results = await tool.search("xyznonexistent")
|
| 410 |
+
|
| 411 |
+
assert results == []
|
| 412 |
+
|
| 413 |
+
@pytest.mark.asyncio
|
| 414 |
+
@respx.mock
|
| 415 |
+
async def test_search_api_error(self):
|
| 416 |
+
"""Search should raise SearchError on API failure."""
|
| 417 |
+
from src.utils.exceptions import SearchError
|
| 418 |
+
|
| 419 |
+
respx.get(url__startswith="https://api.biorxiv.org/details").mock(
|
| 420 |
+
return_value=Response(500, text="Internal Server Error")
|
| 421 |
+
)
|
| 422 |
+
|
| 423 |
+
tool = BioRxivTool()
|
| 424 |
+
|
| 425 |
+
with pytest.raises(SearchError):
|
| 426 |
+
await tool.search("metformin")
|
| 427 |
+
|
| 428 |
+
def test_extract_terms(self):
|
| 429 |
+
"""Should extract meaningful search terms."""
|
| 430 |
+
tool = BioRxivTool()
|
| 431 |
+
|
| 432 |
+
terms = tool._extract_terms("metformin for Alzheimer's disease")
|
| 433 |
+
|
| 434 |
+
assert "metformin" in terms
|
| 435 |
+
assert "alzheimer" in terms
|
| 436 |
+
assert "disease" in terms
|
| 437 |
+
assert "for" not in terms # Stop word
|
| 438 |
+
assert "the" not in terms # Stop word
|
| 439 |
+
|
| 440 |
+
|
| 441 |
+
class TestBioRxivIntegration:
|
| 442 |
+
"""Integration tests (marked for separate run)."""
|
| 443 |
+
|
| 444 |
+
@pytest.mark.integration
|
| 445 |
+
@pytest.mark.asyncio
|
| 446 |
+
async def test_real_api_call(self):
|
| 447 |
+
"""Test actual API call (requires network)."""
|
| 448 |
+
tool = BioRxivTool(days=30) # Last 30 days
|
| 449 |
+
results = await tool.search("diabetes", max_results=3)
|
| 450 |
+
|
| 451 |
+
# May or may not find results depending on recent papers
|
| 452 |
+
assert isinstance(results, list)
|
| 453 |
+
for r in results:
|
| 454 |
+
assert isinstance(r, Evidence)
|
| 455 |
+
assert r.citation.source == "biorxiv"
|
| 456 |
+
```
|
| 457 |
+
|
| 458 |
+
---
|
| 459 |
+
|
| 460 |
+
## 7. Integration with SearchHandler
|
| 461 |
+
|
| 462 |
+
### 7.1 Final SearchHandler Configuration
|
| 463 |
+
|
| 464 |
+
```python
|
| 465 |
+
# examples/search_demo/run_search.py
|
| 466 |
+
from src.tools.biorxiv import BioRxivTool
|
| 467 |
+
from src.tools.clinicaltrials import ClinicalTrialsTool
|
| 468 |
+
from src.tools.pubmed import PubMedTool
|
| 469 |
+
from src.tools.search_handler import SearchHandler
|
| 470 |
+
|
| 471 |
+
search_handler = SearchHandler(
|
| 472 |
+
tools=[
|
| 473 |
+
PubMedTool(), # Peer-reviewed papers
|
| 474 |
+
ClinicalTrialsTool(), # Clinical trials
|
| 475 |
+
BioRxivTool(), # Preprints (cutting edge)
|
| 476 |
+
],
|
| 477 |
+
timeout=30.0
|
| 478 |
+
)
|
| 479 |
+
```
|
| 480 |
+
|
| 481 |
+
### 7.2 Final Type Definition
|
| 482 |
+
|
| 483 |
+
```python
|
| 484 |
+
# src/utils/models.py
|
| 485 |
+
sources_searched: list[Literal["pubmed", "clinicaltrials", "biorxiv"]]
|
| 486 |
+
```
|
| 487 |
+
|
| 488 |
+
---
|
| 489 |
+
|
| 490 |
+
## 8. Definition of Done
|
| 491 |
+
|
| 492 |
+
Phase 11 is **COMPLETE** when:
|
| 493 |
+
|
| 494 |
+
- [ ] `src/tools/biorxiv.py` implemented
|
| 495 |
+
- [ ] Unit tests in `tests/unit/tools/test_biorxiv.py`
|
| 496 |
+
- [ ] Integration test marked with `@pytest.mark.integration`
|
| 497 |
+
- [ ] SearchHandler updated to include BioRxivTool
|
| 498 |
+
- [ ] Type definitions updated in models.py
|
| 499 |
+
- [ ] Example files updated
|
| 500 |
+
- [ ] All unit tests pass
|
| 501 |
+
- [ ] Lints pass
|
| 502 |
+
- [ ] Manual verification with real API
|
| 503 |
+
|
| 504 |
+
---
|
| 505 |
+
|
| 506 |
+
## 9. Verification Commands
|
| 507 |
+
|
| 508 |
+
```bash
|
| 509 |
+
# 1. Run unit tests
|
| 510 |
+
uv run pytest tests/unit/tools/test_biorxiv.py -v
|
| 511 |
+
|
| 512 |
+
# 2. Run integration test (requires network)
|
| 513 |
+
uv run pytest tests/unit/tools/test_biorxiv.py -v -m integration
|
| 514 |
+
|
| 515 |
+
# 3. Run full test suite
|
| 516 |
+
uv run pytest tests/unit/ -v
|
| 517 |
+
|
| 518 |
+
# 4. Run example with all three sources
|
| 519 |
+
source .env && uv run python examples/search_demo/run_search.py "metformin diabetes"
|
| 520 |
+
# Should show results from PubMed, ClinicalTrials.gov, AND bioRxiv/medRxiv
|
| 521 |
+
```
|
| 522 |
+
|
| 523 |
+
---
|
| 524 |
+
|
| 525 |
+
## 10. Value Delivered
|
| 526 |
+
|
| 527 |
+
| Before | After |
|
| 528 |
+
|--------|-------|
|
| 529 |
+
| Only published papers | Published + Preprints |
|
| 530 |
+
| 6-18 month lag | Near real-time research |
|
| 531 |
+
| Miss cutting-edge | Catch breakthroughs early |
|
| 532 |
+
|
| 533 |
+
**Demo pitch (final)**:
|
| 534 |
+
> "DeepCritical searches PubMed for peer-reviewed evidence, ClinicalTrials.gov for 400,000+ clinical trials, and bioRxiv/medRxiv for cutting-edge preprints - then uses LLMs to generate mechanistic hypotheses and synthesize findings into publication-quality reports."
|
| 535 |
+
|
| 536 |
+
---
|
| 537 |
+
|
| 538 |
+
## 11. Complete Source Architecture (After Phase 11)
|
| 539 |
+
|
| 540 |
+
```
|
| 541 |
+
User Query: "Can metformin treat Alzheimer's?"
|
| 542 |
+
|
|
| 543 |
+
v
|
| 544 |
+
SearchHandler
|
| 545 |
+
|
|
| 546 |
+
βββββββββββββββββΌββββββββββββββββ
|
| 547 |
+
| | |
|
| 548 |
+
v v v
|
| 549 |
+
PubMedTool ClinicalTrials BioRxivTool
|
| 550 |
+
| Tool |
|
| 551 |
+
| | |
|
| 552 |
+
v v v
|
| 553 |
+
"15 peer- "3 Phase II "2 preprints
|
| 554 |
+
reviewed trials from last
|
| 555 |
+
papers" recruiting" 90 days"
|
| 556 |
+
| | |
|
| 557 |
+
βββββββββββββββββΌββββββββββββββββ
|
| 558 |
+
|
|
| 559 |
+
v
|
| 560 |
+
Evidence Pool
|
| 561 |
+
|
|
| 562 |
+
v
|
| 563 |
+
EmbeddingService.deduplicate()
|
| 564 |
+
|
|
| 565 |
+
v
|
| 566 |
+
HypothesisAgent β JudgeAgent β ReportAgent
|
| 567 |
+
|
|
| 568 |
+
v
|
| 569 |
+
Structured Research Report
|
| 570 |
+
```
|
| 571 |
+
|
| 572 |
+
**This is the Gucci Banger stack.**
|
docs/implementation/roadmap.md
CHANGED
|
@@ -188,9 +188,12 @@ Structured Research Report
|
|
| 188 |
3. **[Phase 3 Spec: Judge Slice](03_phase_judge.md)** β
|
| 189 |
4. **[Phase 4 Spec: UI & Loop](04_phase_ui.md)** β
|
| 190 |
5. **[Phase 5 Spec: Magentic Integration](05_phase_magentic.md)** β
|
| 191 |
-
6. **[Phase 6 Spec: Embeddings & Semantic Search](06_phase_embeddings.md)**
|
| 192 |
-
7. **[Phase 7 Spec: Hypothesis Agent](07_phase_hypothesis.md)**
|
| 193 |
-
8. **[Phase 8 Spec: Report Agent](08_phase_report.md)**
|
|
|
|
|
|
|
|
|
|
| 194 |
|
| 195 |
---
|
| 196 |
|
|
@@ -203,8 +206,11 @@ Structured Research Report
|
|
| 203 |
| Phase 3: Judge | β
COMPLETE | LLM evidence assessment |
|
| 204 |
| Phase 4: UI & Loop | β
COMPLETE | Working Gradio app |
|
| 205 |
| Phase 5: Magentic | β
COMPLETE | Multi-agent orchestration |
|
| 206 |
-
| Phase 6: Embeddings |
|
| 207 |
-
| Phase 7: Hypothesis |
|
| 208 |
-
| Phase 8: Report |
|
| 209 |
-
|
| 210 |
-
|
|
|
|
|
|
|
|
|
|
|
|
| 188 |
3. **[Phase 3 Spec: Judge Slice](03_phase_judge.md)** β
|
| 189 |
4. **[Phase 4 Spec: UI & Loop](04_phase_ui.md)** β
|
| 190 |
5. **[Phase 5 Spec: Magentic Integration](05_phase_magentic.md)** β
|
| 191 |
+
6. **[Phase 6 Spec: Embeddings & Semantic Search](06_phase_embeddings.md)** β
|
| 192 |
+
7. **[Phase 7 Spec: Hypothesis Agent](07_phase_hypothesis.md)** β
|
| 193 |
+
8. **[Phase 8 Spec: Report Agent](08_phase_report.md)** β
|
| 194 |
+
9. **[Phase 9 Spec: Remove DuckDuckGo](09_phase_source_cleanup.md)** π
|
| 195 |
+
10. **[Phase 10 Spec: ClinicalTrials.gov](10_phase_clinicaltrials.md)** π
|
| 196 |
+
11. **[Phase 11 Spec: bioRxiv Preprints](11_phase_biorxiv.md)** π
|
| 197 |
|
| 198 |
---
|
| 199 |
|
|
|
|
| 206 |
| Phase 3: Judge | β
COMPLETE | LLM evidence assessment |
|
| 207 |
| Phase 4: UI & Loop | β
COMPLETE | Working Gradio app |
|
| 208 |
| Phase 5: Magentic | β
COMPLETE | Multi-agent orchestration |
|
| 209 |
+
| Phase 6: Embeddings | β
COMPLETE | Semantic search + ChromaDB |
|
| 210 |
+
| Phase 7: Hypothesis | β
COMPLETE | Mechanistic reasoning chains |
|
| 211 |
+
| Phase 8: Report | β
COMPLETE | Structured scientific reports |
|
| 212 |
+
| Phase 9: Source Cleanup | π SPEC READY | Remove DuckDuckGo |
|
| 213 |
+
| Phase 10: ClinicalTrials | π SPEC READY | ClinicalTrials.gov API |
|
| 214 |
+
| Phase 11: bioRxiv | π SPEC READY | Preprint search |
|
| 215 |
+
|
| 216 |
+
*Phases 1-8 COMPLETE. Phases 9-11 will add multi-source credibility.*
|
docs/index.md
CHANGED
|
@@ -14,10 +14,17 @@ AI-powered deep research system for accelerating drug repurposing discovery.
|
|
| 14 |
|
| 15 |
### Implementation (Start Here!)
|
| 16 |
- **[Roadmap](implementation/roadmap.md)** - Phased execution plan with TDD
|
| 17 |
-
- **[Phase 1: Foundation](implementation/01_phase_foundation.md)** - Tooling, config, first tests
|
| 18 |
-
- **[Phase 2: Search](implementation/02_phase_search.md)** - PubMed
|
| 19 |
-
- **[Phase 3: Judge](implementation/03_phase_judge.md)** - LLM evidence assessment
|
| 20 |
-
- **[Phase 4: UI](implementation/04_phase_ui.md)** - Orchestrator + Gradio
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
|
| 22 |
### Guides
|
| 23 |
- [Setup Guide](guides/setup.md) (coming soon)
|
|
@@ -76,6 +83,13 @@ User Question β Research Agent (Orchestrator)
|
|
| 76 |
|
| 77 |
## Status
|
| 78 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 79 |
**Architecture Review**: PASSED (98-99/100)
|
| 80 |
-
**
|
| 81 |
-
**Next**:
|
|
|
|
| 14 |
|
| 15 |
### Implementation (Start Here!)
|
| 16 |
- **[Roadmap](implementation/roadmap.md)** - Phased execution plan with TDD
|
| 17 |
+
- **[Phase 1: Foundation](implementation/01_phase_foundation.md)** β
- Tooling, config, first tests
|
| 18 |
+
- **[Phase 2: Search](implementation/02_phase_search.md)** β
- PubMed search
|
| 19 |
+
- **[Phase 3: Judge](implementation/03_phase_judge.md)** β
- LLM evidence assessment
|
| 20 |
+
- **[Phase 4: UI](implementation/04_phase_ui.md)** β
- Orchestrator + Gradio
|
| 21 |
+
- **[Phase 5: Magentic](implementation/05_phase_magentic.md)** β
- Multi-agent orchestration
|
| 22 |
+
- **[Phase 6: Embeddings](implementation/06_phase_embeddings.md)** β
- Semantic search + dedup
|
| 23 |
+
- **[Phase 7: Hypothesis](implementation/07_phase_hypothesis.md)** β
- Mechanistic reasoning
|
| 24 |
+
- **[Phase 8: Report](implementation/08_phase_report.md)** β
- Structured scientific reports
|
| 25 |
+
- **[Phase 9: Source Cleanup](implementation/09_phase_source_cleanup.md)** π - Remove DuckDuckGo
|
| 26 |
+
- **[Phase 10: ClinicalTrials](implementation/10_phase_clinicaltrials.md)** π - Clinical trials API
|
| 27 |
+
- **[Phase 11: bioRxiv](implementation/11_phase_biorxiv.md)** π - Preprint search
|
| 28 |
|
| 29 |
### Guides
|
| 30 |
- [Setup Guide](guides/setup.md) (coming soon)
|
|
|
|
| 83 |
|
| 84 |
## Status
|
| 85 |
|
| 86 |
+
| Phase | Status |
|
| 87 |
+
|-------|--------|
|
| 88 |
+
| Phases 1-8 | β
COMPLETE |
|
| 89 |
+
| Phase 9: Remove DuckDuckGo | π SPEC READY |
|
| 90 |
+
| Phase 10: ClinicalTrials.gov | π SPEC READY |
|
| 91 |
+
| Phase 11: bioRxiv | π SPEC READY |
|
| 92 |
+
|
| 93 |
**Architecture Review**: PASSED (98-99/100)
|
| 94 |
+
**Phases 1-8**: COMPLETE
|
| 95 |
+
**Next**: Phases 9-11 (Multi-Source Enhancement)
|