Report № DA-01Data EngineeringSub-pattern · ETL orchestration01 / 12

ETL Pipeline Rebuild for a Reporting Layer That Was Always Stale

The overnight job ran for six hours. The data was stale by lunch.

§ Client

A mid-stage operating company with a reporting layer used by leadership and customer-facing teams. Data was sourced from multiple internal systems and consolidated into a warehouse via a long-running overnight ETL job that had been written years earlier and modified iteratively since.

§ Problem

The overnight job had grown to a multi-hour duration. By the time leadership reviewed reports in the morning, the underlying data was already several hours old. By midday, the data had aged enough that decisions based on it carried real risk of being wrong. Customer-facing teams who needed near-current data routinely worked around the reporting layer entirely, querying source systems directly with whatever ad-hoc tooling they could assemble. The reporting infrastructure existed but wasn't being used for the decisions it had been built to support.

§ Engagement

A rebuilt ETL pipeline structured for incremental processing rather than nightly bulk reload.

The pipeline was redesigned around Apache Airflow with DAGs decomposed by source system and by data domain. Where the previous job had been a monolithic sequence, the new pipeline was many parallel and pipelined stages with explicit dependencies. Stages that could run independently did. Stages that depended on each other ran as soon as their dependencies completed rather than waiting for an end-of-pipeline marker.

Source ingestion moved from full-table snapshots to change-data-capture where the source systems supported it and to incremental queries with high-watermark tracking where they didn't. Data volume moving through the pipeline per run dropped substantially because the pipeline was processing changes rather than full datasets.

The transformation layer was rewritten in dbt against the warehouse, with models structured for incremental materialization. Each model declared what it depended on and how it should be rebuilt; dbt handled the dependency resolution and incremental logic. The team gained a transformation layer they could reason about as code rather than as a sequence of stored procedures.

Data quality checks were instrumented at each pipeline boundary. Issues surfaced at the boundary they were introduced rather than propagating downstream into reports nobody could trust.

The pipeline's run frequency moved from nightly to hourly for most domains and to near-real-time for the small number of domains where freshness genuinely mattered.

§ Outcome

Reporting freshness moved from "yesterday plus a few hours of staleness" to "current within the hour" for most data and "current within minutes" for the data that needed it. The customer-facing teams who had been working around the reporting layer started using it again. Leadership decisions based on morning reports stopped carrying the staleness risk that had been quietly affecting them.

The architectural change — incremental over bulk — was the leverage point. The tooling change followed from it.

End of Report

DA-01 · Data

End of Transmission

Building something with shape similar to this?

Talk to a Data Architect →