Databricks: how a research lab built a platform data companies depend on

Executive overview

Most enterprise data is unstructured and scattered across incompatible sources. Before a company can run any analysis, it must spend 80–90% of its effort just cleaning and unifying that data. Databricks was built to solve exactly that problem, at cloud scale.

Seven Berkeley researchers — convinced cloud, data, and open source would all be big — founded Databricks in 2009. They built Apache Spark, commercialised it with a proprietary higher-performance layer, and kept expanding outward into adjacent problems their customers already had.

The core insight: open source wins adoption; the only way to monetise it is to build a genuinely better paid product — not just add enterprise features, but make the core thing faster and more capable.

The founding bets and why they paid off

  • Seven co-founders from Berkeley's AmpLab, working alongside early cloud computing researchers
  • Three simultaneous bets: cloud will be big, data will be big, open source is a viable go-to-market
  • One co-founder created Apache Spark; Databricks was built to commercialise it
  • Named Databricks — not Spark — to signal the company would always be more than one technology
  • Deliberately chose not to build an on-prem version, even before cloud adoption was certain

Commercialising open source: hitting two home runs

  • The first home run: build open source technology that gets mainstream adoption
  • The second home run: build a business on top of it — far harder, and most companies fail here
  • The typical trap: monetise enterprise add-ons (SSO, governance) rather than the core product
  • Databricks' answer: build a proprietary implementation of Spark that was meaningfully faster and more reliable — and charge for that
  • Closest analogy: paying for the smarter LLM model, not for ancillary features

Product expansion: from Spark to platform

  • After Spark, extended naturally to the data scientist's toolchain with MLflow (open source, model lifecycle management)
  • Introduced Delta, a first step toward transactional (ACID) database workloads
  • Marketed Delta with "Spark on ACID" T-shirts — an early sign of commercial and marketing instinct
  • Introduced a SQL data warehouse product that competes directly with Snowflake; now on pace for $1B in revenue
  • Coined the term lake house (combining data lake and data warehouse) — initially ridiculed, now an accepted industry category
  • Evolution path: single product → multi-product → multi-persona (data engineers, data scientists, data analysts)

Databricks vs Snowflake: co-existence and competition

  • Many enterprises use both: Databricks processes raw data, Snowflake warehouses the structured output
  • Snowflake has tried to move upstream into data engineering; Databricks moved downstream into warehousing
  • Databricks' warehouse product has scaled faster than Snowflake's equivalent engineering product
  • Key strategic move: embraced open storage formats and did not charge for storage, forcing Snowflake to defend its proprietary storage lock-in
  • Customers can run Databricks on top of data where it already sits — no forced migration

Business model and economics

  • Usage-based pricing: customers pay for compute consumed across all workloads
  • Net dollar expansion rate above 140% — high retention and embedded growth
  • Stickiness drivers: mission-critical production use cases (fraud detection, recommendations), data gravity once catalogued, reuse of processed data across multiple models
  • Capital-light model: most workloads are CPU-based, not GPU-intensive
  • Running free cash flow positive at $4B+ ARR
  • R&D and people are the primary cost; no significant infrastructure capex
  • Large fundraising rounds primarily used to cover employee equity tax obligations as RSUs vest — not pure secondary liquidity

AI tailwinds and new product bets

  • AI has already driven ~$1B of Databricks' ARR
  • Enterprises now widely accept: no AI strategy without a data strategy — a direct tailwind to Databricks' core business
  • AI-native companies (including major labs) use Databricks internally
  • New product stack — AgentBricks and LakeBase — targets enterprise agentic application development
  • Key capabilities being built: RAG pipelines, vector databases, embeddings, model evaluation and monitoring
  • Model serving (hosting inference endpoints for customers) is an emerging GPU-cost business, but still small relative to the core

Hyperscaler relationships: co-opetition

  • First major partnership was Azure Databricks — gave Databricks enterprise distribution at launch
  • Hyperscalers (AWS, Azure, GCP) all have competing data products, but customers using Databricks still consume hyperscaler compute and storage
  • Databricks has deliberately never positioned itself as an existential threat to hyperscaler revenue
  • The co-opetition dynamic has remained synergistic: enough alignment to avoid hyperscalers going all-in to kill the product

Risks and what to watch

  • Execution at scale across a rapidly expanding product surface area is not guaranteed
  • The AI product portfolio (AgentBricks, LakeBase) still needs to achieve the same category-defining success as the lake house
  • Maintaining long-term, first-principles decision-making as the company grows is a cultural risk
  • Staying private has insulated them from short-term pressure; going public would change that calculus
  • Every short-term compromise (monetise sooner, build on-prem, name the company after the product) opens downstream vulnerability — the pattern holds going forward too

Long-termism as competitive advantage

  • Named the company Databricks not Spark to preserve optionality
  • Never built on-prem, even when cloud adoption was uncertain
  • Gave away storage rather than monetise it, to undercut Snowflake's architecture
  • Open-sourced strategically to drive adoption, then charged only for genuine performance differentiation
  • Treated each category bet (lake house, agentic) as a conviction-based long-term call, not a TAM expansion exercise

More like this — when you're ready for early access.

Join the waitlist for a personal account and content recommendations based on what you're working on.

No spam. Unsubscribe at any time.

You're on the list. We'll be in touch before launch.

Get early access to the full library.

Join the waitlist for a personal account and content recommendations based on what you're working on.

No spam. Unsubscribe at any time.

You're on the list. We'll be in touch before launch.

Be among the first to get personalised recommendations tailored to your stage in business.

No spam.

You're on the list. We'll be in touch before launch.

Be among the first to get personalised recommendations tailored to your stage in business.

No spam.

You're on the list. We'll be in touch before launch.