Seven UC Berkeley researchers built the tool they wished existed for handling massive datasets, then realized every Fortune 500 company wished it existed too. Databricks went from an open-source research project to a $62 billion data juggernaut in a decade — and they did it by giving away the core technology for free and charging for the premium version. It's the "give them the razor, sell them the blades" playbook, except the razor processes petabytes.
Founded
2013
HQ
San Francisco, California
Total Raised
$4.2 billion
Founder
Ali Ghodsi, Andy Konwinski, Arsalan Tavakoli-Shiraji, Ion Stoica, Matei Zaharia, Patrick Wendell, Reynold Xin
Status
Private ($62B valuation)
Website
www.databricks.comTHE ORIGIN STORY
Databricks started as a research project at UC Berkeley's AMPLab around 2009. Matei Zaharia, a PhD student, was frustrated with how slow Hadoop MapReduce was for iterative machine learning workloads.
His answer was Apache Spark — an open-source engine that could process data up to 100x faster than MapReduce by keeping data in memory instead of writing to disk after every step.
Spark took off fast in the open-source community. By 2013, it was the most active open-source project in big data.
Zaharia and six Berkeley colleagues — Ali Ghodsi, Andy Konwinski, Arsalan Tavakoli-Shiraji, Ion Stoica, Patrick Wendell, and Reynold Xin — decided to build a company around it. They incorporated Databricks in 2013 with the idea that Spark was powerful but brutally hard to set up and manage.
The company would offer a managed cloud platform that made Spark accessible to data teams who weren't distributed systems engineers.
Their first product was essentially "Spark as a service" — a collaborative notebook environment where data scientists and engineers could write Spark jobs without managing clusters. The bet was that enterprises had massive data problems but not enough PhDs to solve them.
They were right.
WHAT THEY ACTUALLY DO
Databricks runs on a consumption-based pricing model. Companies pay for the compute and storage they actually use on the Databricks platform, measured in "Databricks Units" (DBUs).
The more data you process, the more you pay. This is brilliant because it means revenue grows automatically as customers' data volumes grow — which in the age of AI, they always do.
The platform runs on top of the major cloud providers — AWS, Azure, and Google Cloud. Databricks doesn't own servers.
They're a software layer that makes those clouds dramatically more useful for data work. They take a margin on top of the underlying cloud compute costs, essentially acting as a "toll booth" between companies and their data.
They also pioneered the "lakehouse" architecture — a mashup of data warehouses (structured, fast querying) and data lakes (cheap, handles any data format). Before Databricks, companies had to maintain both.
The lakehouse collapses them into one system. This isn't just clever marketing — it genuinely saves enterprises millions in duplicate infrastructure.
THE PRODUCTS
Unity Catalog — a universal governance layer that lets companies manage permissions, lineage, and access control across all their data and AI assets in one place. Delta Lake — an open-source storage layer that brings reliability to data lakes with ACID transactions, schema enforcement, and time travel (yes, you can query your data as it existed at any point in the past).
Databricks SQL — a serverless SQL analytics product that competes directly with Snowflake on their home turf. Mosaic AI — their machine learning and generative AI platform, supercharged after acquiring MosaicML in 2023 for $1.3 billion.
Databricks Notebooks — collaborative workspaces where data teams write code, visualize results, and build pipelines together in real time.
HOW THEY GREW
Databricks grew by being genuinely useful before being profitable. They contributed massively to Apache Spark's open-source ecosystem, which meant thousands of companies were already using Spark when Databricks offered to manage it for them.
The open-source-to-enterprise pipeline is the most powerful go-to-market motion in software.
They also bet big on partnerships. The Microsoft partnership was transformational — Azure Databricks became a first-party service on Azure, meaning Microsoft's sales force was effectively selling Databricks to every enterprise customer.
That single deal probably added billions in annual recurring revenue.
Acquisitions were strategic and well-timed. MosaicML in 2023 for $1.3 billion gave them proprietary AI training capabilities right when every enterprise wanted to build custom AI models.
Tabular in 2024 brought the creators of Apache Iceberg, another critical open-source data format. They bought the talent and the technology simultaneously.
THE HARD PART
The elephant in the room is Snowflake. Both companies want to be the single platform where enterprises do all their data work, and the overlap is growing fast.
Snowflake started in SQL analytics and is pushing into data engineering and ML. Databricks started in data engineering and ML and is pushing into SQL analytics.
The collision is inevitable and expensive — both are spending billions on sales and R&D.
There's also the cloud provider threat. AWS, Azure, and Google Cloud all have their own data analytics services and could theoretically squeeze Databricks by making their native tools better or cheaper.
Databricks runs ON these clouds, which means their biggest partners are also their biggest potential competitors. It's the classic platform risk problem.
So far, Databricks has stayed ahead by innovating faster than the cloud providers' internal teams, but it's a race that never ends.
MONEY TRAIL
Series A
2013 · Led by Andreessen Horowitz
$14M raised
Series B
2014 · Led by Andreessen Horowitz
$33M raised
Series C
2016 · Led by NEA
$60M raised
Series D
2017 · Led by Andreessen Horowitz
$140M raised
Series E
2019 · Led by Andreessen Horowitz
$250M raised
$6.2B valuation
Series F
2020 · Led by T. Rowe Price
$400M raised
$6.2B valuation
Series G
2021 · Led by Franklin Templeton
$1000M raised
$28.0B valuation
Series H
2021 · Led by Morgan Stanley
$1600M raised
$38.0B valuation
Series I
2023 · Led by T. Rowe Price
$500M raised
$43.0B valuation
Series J
2024 · Led by Thrive Capital
$10000M raised
$62.0B valuation
WHO BACKED THEM
Andreessen Horowitz led multiple early rounds and has been the longest-standing institutional backer. Microsoft made a massive strategic investment alongside the Azure Databricks partnership.
T. Rowe Price, Tiger Global, and Franklin Templeton participated in later growth rounds.
NEA was an early investor. The $10 billion Series J in 2024 valued the company at $62 billion and was led by Thrive Capital with participation from Andreessen Horowitz, DST Global, GIC, Insight Partners, and WCM Investment Management.
Related Profiles
If You Invested
$1,000 CALCULATOR
See what your early investment would be worth today.