Bellok commited on
Commit
58c8726
·
1 Parent(s): 6abf740

refactor: reduce arxiv dataset limit to 50k for deployment efficiency

Browse files

- Lowered arxiv_limit from 100,000 to 50,000 to balance paper coverage with faster deployment times
- Updated comments to reflect the new balanced capacity, ensuring sufficient knowledge base without excessive ingestion duration

Files changed (1) hide show
  1. app.py +2 -2
app.py CHANGED
@@ -99,8 +99,8 @@ if len(documents) == 0:
99
  try:
100
  print(f"📦 Downloading {dataset} (timeout: 3 minutes)...")
101
 
102
- # Scale up now that CPU embeddings are solid - 100k papers for massive knowledge base
103
- arxiv_limit = 100000 if dataset == "arxiv" else None # Maximum capacity now!
104
 
105
  success = ingestor.ingest_dataset(dataset, arxiv_limit=arxiv_limit)
106
  if success:
 
99
  try:
100
  print(f"📦 Downloading {dataset} (timeout: 3 minutes)...")
101
 
102
+ # Balance between coverage and deployment time - 50k arxiv papers plus all other packs
103
+ arxiv_limit = 50000 if dataset == "arxiv" else None # Balanced capacity
104
 
105
  success = ingestor.ingest_dataset(dataset, arxiv_limit=arxiv_limit)
106
  if success: