AI / ML training-data egress cost
Public training datasets live on HuggingFace, Common Crawl (S3 us-east-1), Azure Open Datasets, GCS / requester-pays, AWS Open Data. Pulling them into your training cluster involves either source-side egress (usually free for these public datasets) or your cloud's ingress (always free). The real cost is when you re-egress them across regions or accounts.
Major dataset sources
| Dataset | Approx size | Host | Cost to pull |
|---|---|---|---|
| Common Crawl (one snapshot) | ~400 TB raw / ~90 TB WET | AWS S3 us-east-1 (Open Data) | Free from S3 to EC2 in us-east-1; cross-region: $0.02/GB AWS; to other cloud: $0.09/GB. |
| LAION-5B (image URL index) | ~80 GB metadata; images fetched from URLs | HuggingFace / academic mirrors | Metadata download free. Image fetches incur HTTP costs but no egress (you GET from random URLs). |
| ImageNet (ILSVRC2012) | ~150 GB | image-net.org / Kaggle / academic | Free download. Re-uploading to GCS / S3 in your training region is the ingress (free). |
| The Pile (uncopyrighted text) | ~825 GB compressed | academic + HuggingFace | Free academic mirrors; HF mirror is also free download. |
| HuggingFace datasets (general) | varies | HF (Cloudflare R2 backed) | HF egress is free at source (R2). Ingress to your cloud is free. Effectively $0. |
| WMT translation corpora | ~50 GB compressed | statmt.org + academic mirrors | Free. |
| AWS Open Data Program (general) | petabytes total | S3 us-east-1 / us-west-2 | Same-region S3 to EC2 free; cross-region or out-of-AWS costs standard egress. |
Worked example: train on Common Crawl from outside us-east-1
You want to train a 7B LLM on the WET subset of one Common Crawl snapshot (~90 TB). The data is in S3 us-east-1.
Lesson: train where the data lives. Common Crawl, Pile, and most ML benchmarks have AWS us-east-1 mirrors. Putting your GPU cluster there saves five figures per project.
HuggingFace is the unsung egress hero
HuggingFace switched its model and dataset storage to Cloudflare R2 in 2023. R2 has zero egress fees. Pulling a 1 TB dataset or model checkpoint from HuggingFace into your cloud costs HuggingFace nothing and you nothing. Compare with pulling a model snapshot from your own S3 bucket: 1 TB = $90+ in egress every time you re-pull on a fresh instance.
Practical implication: when you publish a model or dataset for internal team use, hosting it on HF (private repo if needed) costs nothing for downloads. Hosting it on S3 costs you every time someone re-pulls. For an ML team with weekly fresh-instance training runs, this can be hundreds per month in saved egress.
See also: LLM inference egress, Cloudflare R2 zero egress.