Spaces:

Bellok
/

warbler-cda

Running on Zero

Bellok commited on 8 days ago

Commit

58c8726

1 Parent(s): 6abf740

refactor: reduce arxiv dataset limit to 50k for deployment efficiency

- Lowered arxiv_limit from 100,000 to 50,000 to balance paper coverage with faster deployment times
- Updated comments to reflect the new balanced capacity, ensuring sufficient knowledge base without excessive ingestion duration

Files changed (1) hide show

app.py +2 -2

app.py CHANGED Viewed

@@ -99,8 +99,8 @@ if len(documents) == 0:
                 try:
                     print(f"📦 Downloading {dataset} (timeout: 3 minutes)...")
-                    # Scale up now that CPU embeddings are solid - 100k papers for massive knowledge base
-                    arxiv_limit = 100000 if dataset == "arxiv" else None  # Maximum capacity now!
                     success = ingestor.ingest_dataset(dataset, arxiv_limit=arxiv_limit)
                     if success:

                 try:
                     print(f"📦 Downloading {dataset} (timeout: 3 minutes)...")
+                    # Balance between coverage and deployment time - 50k arxiv papers plus all other packs
+                    arxiv_limit = 50000 if dataset == "arxiv" else None  # Balanced capacity
                     success = ingestor.ingest_dataset(dataset, arxiv_limit=arxiv_limit)
                     if success: