What it does
Instead of running your own crawlers, you search the archive, filter what you need (by time range, domain, URL patterns, language, blocking signals), and export ready-to-use datasets as HTML files + metadata.Common use cases
- LLM training and RAG pipelines: Build or refresh training corpora from targeted web segments
- Search and indexing: Backfill indexes with historical content across large domain sets
- Search product augmentation: Improve coverage for sites with advanced blocking, supporting reliable page retrieval at scale
How it works
Run a search
Filter by time range, domains, URL patterns, language, or signals (CAPTCHA, robots blocks, etc.)
Review the estimate
See matched file count, snapshot size, expected duration, and cost
Create and deliver a dump
Export the snapshot as HTML files + metadata (URL, timestamp, collection attributes) to Amazon S3, Azure Blob Storage, or via webhook