Report № DA-04Data EngineeringSub-pattern · Data lake to lakehouse04 / 12

Lakehouse Migration for an Organization Whose Data Lake Had Become a Data Swamp

Years of files, no governance, no schema registry, no idea what was where.

§ Client

A medium-sized enterprise that had adopted a data lake architecture several years earlier with the intent of consolidating data across the organization for analytics. The implementation had grown organically — teams dropped files into the lake when convenient, schemas evolved without coordination, and discovery had degraded to the point where analysts spent significant time finding data before they could analyze it.

§ Problem

The data lake had become unsearchable. Multiple copies of the same data existed under different paths because the team that needed it couldn't find what was already there. Analytics queries against the lake were slow and unpredictable because the underlying file layouts had been optimized for nothing in particular. Governance — who owned what, who could access what, what counted as the authoritative version — had been deferred and the deferral had become structural.

The leadership had concluded that the lake architecture wasn't fundamentally wrong but the implementation needed to be rebuilt as a managed lakehouse rather than continuing as an unstructured file dump.

§ Engagement

A staged migration to a Delta Lake-based lakehouse architecture.

A discovery exercise mapped what was actually in the lake. Datasets were identified, owners were assigned where ownership had been ambiguous, and the authoritative version of each dataset was designated where multiple copies existed. The non-authoritative copies were tagged for eventual deletion.

Data was migrated to Delta tables with appropriate partitioning. Partitioning strategy was chosen per table against the actual query patterns the team was running, not against generic recommendations. Optimize and Z-order operations were scheduled for the tables where they materially helped.

A catalog layer — Unity Catalog — was implemented across the lakehouse. Datasets became discoverable through the catalog rather than through tribal knowledge. Access control was unified at the catalog layer rather than scattered across the underlying storage layer.

Data quality monitoring was instrumented at table boundaries. Tables that mattered had quality expectations defined and validated continuously; quality regressions surfaced through the catalog rather than being discovered downstream by analysts producing wrong reports.

Governance practices were documented and adopted. Dataset ownership, versioning, retirement, and access requests had defined paths rather than being negotiated case by case.

§ Outcome

Discovery time for data dropped substantially. Query performance against the most-used tables improved noticeably. Multiple-copy duplication declined as teams started finding what already existed before creating new versions. Governance became operational rather than aspirational. The lake stopped being a swamp.

End of Report

DA-04 · Data

End of Transmission

Building something with shape similar to this?

Talk to a Data Architect →