Buying Data Lake
Buyer's guide.
Overview
A data lake stores raw and semi-structured data cheaply, ahead of the schema decisions that downstream pipelines will eventually require. Choosing one is mostly about table-format choice (Iceberg, Delta, Hudi), object-storage backend, and which compute engines can read the lake without an ETL hop.
- Open table formats. Iceberg, Delta Lake, Hudi all add ACID semantics and time-travel to object storage; pick one your downstream engines support.
- Storage backend. S3, GCS, ADLS, or on-prem MinIO. Pricing axis differs; egress is the silent killer of multi-cloud lake plans.
- Compute interop. Spark, Trino, Presto, Athena, and warehouse engines should all read the same tables; tools that require ETL into a vendor format become lock-in.
- Per-team decision and exit cost. Open table formats keep the door open; proprietary formats turn the lake into a vendor moat.
The approach
Match the choice to the cloud gravity, the dominant compute engine, and the openness budget. Lakes outlive most of the analytics tools that read from them; pick deliberately.
- Table-format selection. Iceberg for broad engine support, Delta for tight Databricks integration, Hudi for streaming-heavy workloads.
- Storage and compute alignment. S3 plus Athena, GCS plus BigQuery External, ADLS plus Synapse. Cross-cloud is real but expensive on egress.
- Total cost of ownership model. Storage, query, metadata, and egress across a 12-month projection at expected volume.
- Document the choice and the exit ramp. Capture rationale and how data would migrate if the dominant compute engine changed.
Why this compounds
The right data lake keeps paying back: ingestion stays cheap, downstream pipelines stop fighting schema, and analytics decisions stop waiting on infrastructure.
- Operational fit. Open formats keep new engines interoperable; closed formats compound technical debt.
- Cost efficiency. Cheap storage plus pay-per-query compute scales linearly with traffic.
- Engineering culture. One lake serves analytics, ML, and operational data; teams stop maintaining parallel copies.
- Decision trail for the next renewal. The trial data becomes the renewal scorecard, not a cold start.