Skip to main content

What it does

Instead of running your own crawlers, you search the archive, filter what you need (by time range, domain, URL patterns, language, blocking signals), and export ready-to-use datasets as HTML files + metadata.

Common use cases

  • LLM training and RAG pipelines: Build or refresh training corpora from targeted web segments
  • Search and indexing: Backfill indexes with historical content across large domain sets
  • Search product augmentation: Improve coverage for sites with advanced blocking, supporting reliable page retrieval at scale

How it works

Run a search

Filter by time range, domains, URL patterns, language, or signals (CAPTCHA, robots blocks, etc.)

Review the estimate

See matched file count, snapshot size, expected duration, and cost

Create and deliver a dump

Export the snapshot as HTML files + metadata (URL, timestamp, collection attributes) to Amazon S3, Azure Blob Storage, or via webhook