Databricks: how a research lab built a platform data companies depend on

Executive overview

Most enterprise data is unstructured and scattered across incompatible sources. Before a company can run any analysis, it must spend 80–90% of its effort just cleaning and unifying that data. Databricks was built to solve exactly that problem, at cloud scale.

Seven Berkeley researchers — convinced cloud, data, and open source would all be big — founded Databricks in 2009. They built Apache Spark, commercialised it with a proprietary higher-performance layer, and kept expanding outward into adjacent problems their customers already had.

The core insight: open source wins adoption; the only way to monetise it is to build a genuinely better paid product — not just add enterprise features, but make the core thing faster and more capable.

The founding bets and why they paid off

Seven co-founders from Berkeley's AmpLab, working alongside early cloud computing researchers
Three simultaneous bets: cloud will be big, data will be big, open source is a viable go-to-market
One co-founder created Apache Spark; Databricks was built to commercialise it
Named Databricks — not Spark — to signal the company would always be more than one technology
Deliberately chose not to build an on-prem version, even before cloud adoption was certain

Commercialising open source: hitting two home runs

The first home run: build open source technology that gets mainstream adoption
The second home run: build a business on top of it — far harder, and most companies fail here
The typical trap: monetise enterprise add-ons (SSO, governance) rather than the core product
Databricks' answer: build a proprietary implementation of Spark that was meaningfully faster and more reliable — and charge for that
Closest analogy: paying for the smarter LLM model, not for ancillary features

Product expansion: from Spark to platform

After Spark, extended naturally to the data scientist's toolchain with MLflow (open source, model lifecycle management)
Introduced Delta, a first step toward transactional (ACID) database workloads
Marketed Delta with "Spark on ACID" T-shirts — an early sign of commercial and marketing instinct
Introduced a SQL data warehouse product that competes directly with Snowflake; now on pace for $1B in revenue
Coined the term lake house (combining data lake and data warehouse) — initially ridiculed, now an accepted industry category
Evolution path: single product → multi-product → multi-persona (data engineers, data scientists, data analysts)

Databricks vs Snowflake: co-existence and competition

Many enterprises use both: Databricks processes raw data, Snowflake warehouses the structured output
Snowflake has tried to move upstream into data engineering; Databricks moved downstream into warehousing
Databricks' warehouse product has scaled faster than Snowflake's equivalent engineering product
Key strategic move: embraced open storage formats and did not charge for storage, forcing Snowflake to defend its proprietary storage lock-in
Customers can run Databricks on top of data where it already sits — no forced migration

Business model and economics

Usage-based pricing: customers pay for compute consumed across all workloads
Net dollar expansion rate above 140% — high retention and embedded growth
Stickiness drivers: mission-critical production use cases (fraud detection, recommendations), data gravity once catalogued, reuse of processed data across multiple models
Capital-light model: most workloads are CPU-based, not GPU-intensive
Running free cash flow positive at $4B+ ARR
R&D and people are the primary cost; no significant infrastructure capex
Large fundraising rounds primarily used to cover employee equity tax obligations as RSUs vest — not pure secondary liquidity

AI tailwinds and new product bets

AI has already driven ~$1B of Databricks' ARR
Enterprises now widely accept: no AI strategy without a data strategy — a direct tailwind to Databricks' core business
AI-native companies (including major labs) use Databricks internally
New product stack — AgentBricks and LakeBase — targets enterprise agentic application development
Key capabilities being built: RAG pipelines, vector databases, embeddings, model evaluation and monitoring
Model serving (hosting inference endpoints for customers) is an emerging GPU-cost business, but still small relative to the core

Hyperscaler relationships: co-opetition

First major partnership was Azure Databricks — gave Databricks enterprise distribution at launch
Hyperscalers (AWS, Azure, GCP) all have competing data products, but customers using Databricks still consume hyperscaler compute and storage
Databricks has deliberately never positioned itself as an existential threat to hyperscaler revenue
The co-opetition dynamic has remained synergistic: enough alignment to avoid hyperscalers going all-in to kill the product

Risks and what to watch

Execution at scale across a rapidly expanding product surface area is not guaranteed
The AI product portfolio (AgentBricks, LakeBase) still needs to achieve the same category-defining success as the lake house
Maintaining long-term, first-principles decision-making as the company grows is a cultural risk
Staying private has insulated them from short-term pressure; going public would change that calculus
Every short-term compromise (monetise sooner, build on-prem, name the company after the product) opens downstream vulnerability — the pattern holds going forward too

Long-termism as competitive advantage

Named the company Databricks not Spark to preserve optionality
Never built on-prem, even when cloud adoption was uncertain
Gave away storage rather than monetise it, to undercut Snowflake's architecture
Open-sourced strategically to drive adoption, then charged only for genuine performance differentiation
Treated each category bet (lake house, agentic) as a conviction-based long-term call, not a TAM expansion exercise

Databricks: how a research lab built a platform data companies depend on

Executive overview

The founding bets and why they paid off

Commercialising open source: hitting two home runs

Product expansion: from Spark to platform

Databricks vs Snowflake: co-existence and competition

Business model and economics

AI tailwinds and new product bets

Hyperscaler relationships: co-opetition

Risks and what to watch

Long-termism as competitive advantage

More like this — when you're ready for early access.

Get early access to the full library.

Be among the first to get personalised recommendations tailored to your stage in business.

Be among the first to get personalised recommendations tailored to your stage in business.

Executive overview

The founding bets and why they paid off

Commercialising open source: hitting two home runs

Product expansion: from Spark to platform

Databricks vs Snowflake: co-existence and competition

Business model and economics

AI tailwinds and new product bets

Hyperscaler relationships: co-opetition

Risks and what to watch

Long-termism as competitive advantage

More like this — when you're ready for early access.

More in Strategy

Why most business leaders harvest instead of grow

Should you go to Silicon Valley, and how can Stockholm thrive as a startup hub?

Five levels to turning your knowledge into a $100K business

Get early access to the full library.

Be among the first to get personalised recommendations tailored to your stage in business.

Be among the first to get personalised recommendations tailored to your stage in business.