Dutch version
Data lakehouses are the talk in many organizations. Few, however, are running them confidently in production. We asked our Databricks Champion - Jacob Nurup - what actually works: how to hit parity first, raise productivity next, and keep resilience always, without blowing up governance or cost.
What's your advice for teams starting their data lakehouse journey?
Don't try to boil the ocean. Start with one high-value domain where success is measurable, maybe a critical operational dashboard or a key data product feeding downstream systems. Establish your three non-negotiables: platform guard rails, governed reuse, and CI/CD. Prove parity with your existing outputs. Then ship something real in weeks, not quarters.
Build momentum through quick wins but never compromise on governance or testing. The teams that succeed are the ones who resist the temptation to cut corners early on. Those corners become technical debt that slows you down later. Set the right foundation, one operating model, shared patterns, automated quality checks, and you can scale confidently. That's how you go from pilot to platform.
When does a data lakehouse stop being architecture and start being run-ready?
The moment parity - compared to the data warehouse - is real: same numbers, same or better SLAs, same controls. For instance, during client a project at a pension services provider, we spent the first phase proving equivalence, reconciling data between the old warehouse and new lakehouse, validating performance, and replicating access patterns and lineage. We put everything - pipelines, service endpoints, governance - on one operating model so policies, secrets, identity, and change control behave the same way everywhere. That's your stable floor. Only after we proved parity did we start optimizing for speed and new capabilities. Without that foundation, you're building on sand.
What are the three non-negotiables you standardize first?
First, platform guard rails:
- Delta Lake patterns for bronze-silver-gold layers.
- Azure Data Factory and Databricks Workflows for orchestration.
- Unity Catalog as the single source of truth for access and lineage.
Everything flows through these rails, no custom pipelines that bypass governance.
Second, governed reuse: data contracts between layers, shared feature tables, and curated components in a central repository. When we're building AI agents or an MDM solution, teams pull approved patterns instead of starting from scratch. This cuts delivery time dramatically.
Third, CI/CD as the only path to production for both pipelines and services. Automated testing, approval gates, and non-regression checks by default. No manual deployments, no "we'll add tests later." This discipline is what lets you scale without breaking things.
How do you get from legacy data warehouse to first outcomes fast?
Pick one domain and prove the pattern end-to-end. When we migrated to a new platform at the pension services provider, we didn't try to lift-and-shift everything at once. We selected a few critical data feeds, established the bronze-silver-gold flow, reconciled outputs to the required tolerance levels, and delivered working dashboards and APIs. The key was automated testing; we validated transformations at every step to catch issues early.
Once the first domain is live and proven, the second and third move much faster because the patterns, pipelines, and governance are already in place. We've demonstrated migrations of roughly 350 feeds in about six months with under 2% major UAT anomalies using this approach. The secret is starting small, proving value quick, and then replicating what works.
What keeps the platform safe and affordable as you scale?
Treat resilience and costs like safety properties, not nice-to-haves. We define SLOs for every critical job, implement unified observability with dashboards monitoring job health, data quality, and platform performance. Disaster recovery isn't an afterthought; it's tested regularly.
On the cost side, FinOps is built in from day one: right-sizing clusters, auto-termination policies, spending quotas, and unit-cost budgets per workload. Unity Catalog enforces access policies as code, so governance scales without manual overhead.
The result? Fewer false positives in risk and compliance checks, thousands of operational hours saved, and predictable monthly spend, these guardrails were essential as we scaled from initial deployment to hundreds of production workloads.
Where do Databricks components fit without creating lock-in?
Databricks provides a cohesion layer, not handcuffs:
- Delta Lake standardizes our storage format open, performant, and compatible with any compute engine.
- Delta Live Tables (now Lakeflow Declarative Pipelines) and Workflows give us reliable orchestration with automatic retries and quality checks.
- Unity Catalog proves who accessed what data and why critical for audit and compliance.
- MLflow makes model deployment auditable and repeatable, from experimentation through production monitoring.
Because it's an open Lakehouse, you're not locked in. We regularly connect Power BI, external APIs, and other tools to the same Delta tables. You can reuse what works from your existing warehouse, star schemas, materialized views, BI reports and refactor only when there's clear value. At the same time, you can layer in domain-specific IP for financial crime detection, document processing, pricing optimization, or SRE operations. The platform adapts to your needs rather than forcing you into rigid patterns.