Buying Data Lake

Buyer's guide.

Overview

A data lake stores raw and semi-structured data cheaply, ahead of the schema decisions that downstream pipelines will eventually require. Choosing one is mostly about table-format choice (Iceberg, Delta, Hudi), object-storage backend, and which compute engines can read the lake without an ETL hop.

Open table formats. Iceberg, Delta Lake, Hudi all add ACID semantics and time-travel to object storage; pick one your downstream engines support.
Storage backend. S3, GCS, ADLS, or on-prem MinIO. Pricing axis differs; egress is the silent killer of multi-cloud lake plans.
Compute interop. Spark, Trino, Presto, Athena, and warehouse engines should all read the same tables; tools that require ETL into a vendor format become lock-in.
Per-team decision and exit cost. Open table formats keep the door open; proprietary formats turn the lake into a vendor moat.

The approach

Match the choice to the cloud gravity, the dominant compute engine, and the openness budget. Lakes outlive most of the analytics tools that read from them; pick deliberately.

Table-format selection. Iceberg for broad engine support, Delta for tight Databricks integration, Hudi for streaming-heavy workloads.
Storage and compute alignment. S3 plus Athena, GCS plus BigQuery External, ADLS plus Synapse. Cross-cloud is real but expensive on egress.
Total cost of ownership model. Storage, query, metadata, and egress across a 12-month projection at expected volume.
Document the choice and the exit ramp. Capture rationale and how data would migrate if the dominant compute engine changed.

Why this compounds

The right data lake keeps paying back: ingestion stays cheap, downstream pipelines stop fighting schema, and analytics decisions stop waiting on infrastructure.

Operational fit. Open formats keep new engines interoperable; closed formats compound technical debt.
Cost efficiency. Cheap storage plus pay-per-query compute scales linearly with traffic.
Engineering culture. One lake serves analytics, ML, and operational data; teams stop maintaining parallel copies.
Decision trail for the next renewal. The trial data becomes the renewal scorecard, not a cold start.