The original is one click away. Open original ↗
Databricks: how a research lab built a platform data companies depend on
Executive overview
Most enterprise data is unstructured and scattered across incompatible sources. Before a company can run any analysis, it must spend 80–90% of its effort just cleaning and unifying that data. Databricks was built to solve exactly that problem, at cloud scale.
Seven Berkeley researchers — convinced cloud, data, and open source would all be big — founded Databricks in 2009. They built Apache Spark, commercialised it with a proprietary higher-performance layer, and kept expanding outward into adjacent problems their customers already had.
The core insight: open source wins adoption; the only way to monetise it is to build a genuinely better paid product — not just add enterprise features, but make the core thing faster and more capable.
The founding bets and why they paid off
- Seven co-founders from Berkeley's AmpLab, working alongside early cloud computing researchers
- Three simultaneous bets: cloud will be big, data will be big, open source is a viable go-to-market
- One co-founder created Apache Spark; Databricks was built to commercialise it
- Named Databricks — not Spark — to signal the company would always be more than one technology
- Deliberately chose not to build an on-prem version, even before cloud adoption was certain
Commercialising open source: hitting two home runs
- The first home run: build open source technology that gets mainstream adoption
- The second home run: build a business on top of it — far harder, and most companies fail here
- The typical trap: monetise enterprise add-ons (SSO, governance) rather than the core product
- Databricks' answer: build a proprietary implementation of Spark that was meaningfully faster and more reliable — and charge for that
- Closest analogy: paying for the smarter LLM model, not for ancillary features
Product expansion: from Spark to platform
- After Spark, extended naturally to the data scientist's toolchain with MLflow (open source, model lifecycle management)
- Introduced Delta, a first step toward transactional (ACID) database workloads
- Marketed Delta with "Spark on ACID" T-shirts — an early sign of commercial and marketing instinct
- Introduced a SQL data warehouse product that competes directly with Snowflake; now on pace for $1B in revenue
- Coined the term lake house (combining data lake and data warehouse) — initially ridiculed, now an accepted industry category
- Evolution path: single product → multi-product → multi-persona (data engineers, data scientists, data analysts)
Databricks vs Snowflake: co-existence and competition
- Many enterprises use both: Databricks processes raw data, Snowflake warehouses the structured output
- Snowflake has tried to move upstream into data engineering; Databricks moved downstream into warehousing
- Databricks' warehouse product has scaled faster than Snowflake's equivalent engineering product
- Key strategic move: embraced open storage formats and did not charge for storage, forcing Snowflake to defend its proprietary storage lock-in
- Customers can run Databricks on top of data where it already sits — no forced migration
Business model and economics
- Usage-based pricing: customers pay for compute consumed across all workloads
- Net dollar expansion rate above 140% — high retention and embedded growth
- Stickiness drivers: mission-critical production use cases (fraud detection, recommendations), data gravity once catalogued, reuse of processed data across multiple models
- Capital-light model: most workloads are CPU-based, not GPU-intensive
- Running free cash flow positive at $4B+ ARR
- R&D and people are the primary cost; no significant infrastructure capex
- Large fundraising rounds primarily used to cover employee equity tax obligations as RSUs vest — not pure secondary liquidity
AI tailwinds and new product bets
- AI has already driven ~$1B of Databricks' ARR
- Enterprises now widely accept: no AI strategy without a data strategy — a direct tailwind to Databricks' core business
- AI-native companies (including major labs) use Databricks internally
- New product stack — AgentBricks and LakeBase — targets enterprise agentic application development
- Key capabilities being built: RAG pipelines, vector databases, embeddings, model evaluation and monitoring
- Model serving (hosting inference endpoints for customers) is an emerging GPU-cost business, but still small relative to the core
Hyperscaler relationships: co-opetition
- First major partnership was Azure Databricks — gave Databricks enterprise distribution at launch
- Hyperscalers (AWS, Azure, GCP) all have competing data products, but customers using Databricks still consume hyperscaler compute and storage
- Databricks has deliberately never positioned itself as an existential threat to hyperscaler revenue
- The co-opetition dynamic has remained synergistic: enough alignment to avoid hyperscalers going all-in to kill the product
Risks and what to watch
- Execution at scale across a rapidly expanding product surface area is not guaranteed
- The AI product portfolio (AgentBricks, LakeBase) still needs to achieve the same category-defining success as the lake house
- Maintaining long-term, first-principles decision-making as the company grows is a cultural risk
- Staying private has insulated them from short-term pressure; going public would change that calculus
- Every short-term compromise (monetise sooner, build on-prem, name the company after the product) opens downstream vulnerability — the pattern holds going forward too
Long-termism as competitive advantage
- Named the company Databricks not Spark to preserve optionality
- Never built on-prem, even when cloud adoption was uncertain
- Gave away storage rather than monetise it, to undercut Snowflake's architecture
- Open-sourced strategically to drive adoption, then charged only for genuine performance differentiation
- Treated each category bet (lake house, agentic) as a conviction-based long-term call, not a TAM expansion exercise
More like this — when you're ready for early access.
Join the waitlist for a personal account and content recommendations based on what you're working on.
No spam. Unsubscribe at any time.
You're on the list. We'll be in touch before launch.