Web Archive Overview

What it does

Instead of running your own crawlers, you search the archive, filter what you need (by time range, domain, URL patterns, language, blocking signals), and export ready-to-use datasets as HTML files + metadata.

Common use cases

LLM training and RAG pipelines: Build or refresh training corpora from targeted web segments
Search and indexing: Backfill indexes with historical content across large domain sets
Search product augmentation: Improve coverage for sites with advanced blocking, supporting reliable page retrieval at scale

How it works

Run a search

Filter by time range, domains, URL patterns, language, or signals (CAPTCHA, robots blocks, etc.)

Review the estimate

See matched file count, snapshot size, expected duration, and cost

Create and deliver a dump

Export the snapshot as HTML files + metadata (URL, timestamp, collection attributes) to Amazon S3, Azure Blob Storage, or via webhook

Introduction

Product Guides

What it does

Common use cases

How it works

Run a search

Review the estimate

Create and deliver a dump

Introduction

Product Guides

​What it does

​Common use cases

​How it works

Run a search

Review the estimate

Create and deliver a dump

What it does

Common use cases

How it works