Live pricingverified 2026-04
AI / ML · training egressUpdated 2026-04

AI / ML training-data egress cost

Public training datasets live on HuggingFace, Common Crawl (S3 us-east-1), Azure Open Datasets, GCS / requester-pays, AWS Open Data. Pulling them into your training cluster involves either source-side egress (usually free for these public datasets) or your cloud's ingress (always free). The real cost is when you re-egress them across regions or accounts.

Major dataset sources

DatasetApprox sizeHostCost to pull
Common Crawl (one snapshot)~400 TB raw / ~90 TB WETAWS S3 us-east-1 (Open Data)Free from S3 to EC2 in us-east-1; cross-region: $0.02/GB AWS; to other cloud: $0.09/GB.
LAION-5B (image URL index)~80 GB metadata; images fetched from URLsHuggingFace / academic mirrorsMetadata download free. Image fetches incur HTTP costs but no egress (you GET from random URLs).
ImageNet (ILSVRC2012)~150 GBimage-net.org / Kaggle / academicFree download. Re-uploading to GCS / S3 in your training region is the ingress (free).
The Pile (uncopyrighted text)~825 GB compressedacademic + HuggingFaceFree academic mirrors; HF mirror is also free download.
HuggingFace datasets (general)variesHF (Cloudflare R2 backed)HF egress is free at source (R2). Ingress to your cloud is free. Effectively $0.
WMT translation corpora~50 GB compressedstatmt.org + academic mirrorsFree.
AWS Open Data Program (general)petabytes totalS3 us-east-1 / us-west-2Same-region S3 to EC2 free; cross-region or out-of-AWS costs standard egress.

Worked example: train on Common Crawl from outside us-east-1

You want to train a 7B LLM on the WET subset of one Common Crawl snapshot (~90 TB). The data is in S3 us-east-1.

Pull pattern
90 TB cost
Spin up EC2 in us-east-1, S3 same-region read
$0 (egress free intra-region)
Spin up EC2 in us-west-2, cross-region S3 read
$1,843 (90 TB x $0.02/GB)
Pull from S3 us-east-1 to GCP us-central1 over internet
$7,602 (AWS internet egress, mostly $0.085 tier)
Pull from S3 us-east-1 to Azure East US over internet
$7,602 (same AWS egress)
Snowball + ship to your data centre
$320 + freight (one 80 TB Snowball + freight)

Lesson: train where the data lives. Common Crawl, Pile, and most ML benchmarks have AWS us-east-1 mirrors. Putting your GPU cluster there saves five figures per project.

HuggingFace is the unsung egress hero

HuggingFace switched its model and dataset storage to Cloudflare R2 in 2023. R2 has zero egress fees. Pulling a 1 TB dataset or model checkpoint from HuggingFace into your cloud costs HuggingFace nothing and you nothing. Compare with pulling a model snapshot from your own S3 bucket: 1 TB = $90+ in egress every time you re-pull on a fresh instance.

Practical implication: when you publish a model or dataset for internal team use, hosting it on HF (private repo if needed) costs nothing for downloads. Hosting it on S3 costs you every time someone re-pulls. For an ML team with weekly fresh-instance training runs, this can be hundreds per month in saved egress.

See also: LLM inference egress, Cloudflare R2 zero egress.

Updated 2 May 2026