Data Analyticsdata-analyticsbig-datamachine-learning

DATABRICKS

Netfigo Verdict

on Databricks

Seven UC Berkeley researchers built the tool they wished existed for handling massive datasets, then realized every Fortune 500 company wished it existed too. Databricks went from an open-source research project to a $62 billion data juggernaut in a decade — and they did it by giving away the core technology for free and charging for the premium version. It's the "give them the razor, sell them the blades" playbook, except the razor processes petabytes.

Founded

2013

San Francisco, California

Total Raised

$4.2 billion

Founder

Ali Ghodsi, Andy Konwinski, Arsalan Tavakoli-Shiraji, Ion Stoica, Matei Zaharia, Patrick Wendell, Reynold Xin

Status

Private ($62B valuation)

Website

www.databricks.com

THE ORIGIN STORY

Databricks started as a research project at UC Berkeley's AMPLab around 2009. Matei Zaharia, a PhD student, was frustrated with how slow Hadoop MapReduce was for iterative machine learning workloads.

His answer was Apache Spark — an open-source engine that could process data up to 100x faster than MapReduce by keeping data in memory instead of writing to disk after every step.

Spark took off fast in the open-source community. By 2013, it was the most active open-source project in big data.

Zaharia and six Berkeley colleagues — Ali Ghodsi, Andy Konwinski, Arsalan Tavakoli-Shiraji, Ion Stoica, Patrick Wendell, and Reynold Xin — decided to build a company around it. They incorporated Databricks in 2013 with the idea that Spark was powerful but brutally hard to set up and manage.

The company would offer a managed cloud platform that made Spark accessible to data teams who weren't distributed systems engineers.

Their first product was essentially "Spark as a service" — a collaborative notebook environment where data scientists and engineers could write Spark jobs without managing clusters. The bet was that enterprises had massive data problems but not enough PhDs to solve them.

They were right.

WHAT THEY ACTUALLY DO

Databricks runs on a consumption-based pricing model. Companies pay for the compute and storage they actually use on the Databricks platform, measured in "Databricks Units" (DBUs).

The more data you process, the more you pay. This is brilliant because it means revenue grows automatically as customers' data volumes grow — which in the age of AI, they always do.

The platform runs on top of the major cloud providers — AWS, Azure, and Google Cloud. Databricks doesn't own servers.

They're a software layer that makes those clouds dramatically more useful for data work. They take a margin on top of the underlying cloud compute costs, essentially acting as a "toll booth" between companies and their data.

They also pioneered the "lakehouse" architecture — a mashup of data warehouses (structured, fast querying) and data lakes (cheap, handles any data format). Before Databricks, companies had to maintain both.

The lakehouse collapses them into one system. This isn't just clever marketing — it genuinely saves enterprises millions in duplicate infrastructure.

THE PRODUCTS

Unity Catalog — a universal governance layer that lets companies manage permissions, lineage, and access control across all their data and AI assets in one place. Delta Lake — an open-source storage layer that brings reliability to data lakes with ACID transactions, schema enforcement, and time travel (yes, you can query your data as it existed at any point in the past).

Databricks SQL — a serverless SQL analytics product that competes directly with Snowflake on their home turf. Mosaic AI — their machine learning and generative AI platform, supercharged after acquiring MosaicML in 2023 for $1.3 billion.

Databricks Notebooks — collaborative workspaces where data teams write code, visualize results, and build pipelines together in real time.

HOW THEY GREW

Databricks grew by being genuinely useful before being profitable. They contributed massively to Apache Spark's open-source ecosystem, which meant thousands of companies were already using Spark when Databricks offered to manage it for them.

The open-source-to-enterprise pipeline is the most powerful go-to-market motion in software.

They also bet big on partnerships. The Microsoft partnership was transformational — Azure Databricks became a first-party service on Azure, meaning Microsoft's sales force was effectively selling Databricks to every enterprise customer.

That single deal probably added billions in annual recurring revenue.

Acquisitions were strategic and well-timed. MosaicML in 2023 for $1.3 billion gave them proprietary AI training capabilities right when every enterprise wanted to build custom AI models.

Tabular in 2024 brought the creators of Apache Iceberg, another critical open-source data format. They bought the talent and the technology simultaneously.

THE HARD PART

The elephant in the room is Snowflake. Both companies want to be the single platform where enterprises do all their data work, and the overlap is growing fast.

Snowflake started in SQL analytics and is pushing into data engineering and ML. Databricks started in data engineering and ML and is pushing into SQL analytics.

The collision is inevitable and expensive — both are spending billions on sales and R&D.

There's also the cloud provider threat. AWS, Azure, and Google Cloud all have their own data analytics services and could theoretically squeeze Databricks by making their native tools better or cheaper.

Databricks runs ON these clouds, which means their biggest partners are also their biggest potential competitors. It's the classic platform risk problem.

So far, Databricks has stayed ahead by innovating faster than the cloud providers' internal teams, but it's a race that never ends.

MONEY TRAIL

Series A

2013 · Led by Andreessen Horowitz

$14M raised

Series B

2014 · Led by Andreessen Horowitz

$33M raised

Series C

2016 · Led by NEA

$60M raised

Series D

2017 · Led by Andreessen Horowitz

$140M raised

Series E

2019 · Led by Andreessen Horowitz

$250M raised

$6.2B valuation

Series F

2020 · Led by T. Rowe Price

$400M raised

$6.2B valuation

Series G

2021 · Led by Franklin Templeton

$1000M raised

$28.0B valuation

Series H

2021 · Led by Morgan Stanley

$1600M raised

$38.0B valuation

Series I

2023 · Led by T. Rowe Price

$500M raised

$43.0B valuation

Series J

2024 · Led by Thrive Capital

$10000M raised

$62.0B valuation

WHO BACKED THEM

Andreessen Horowitz led multiple early rounds and has been the longest-standing institutional backer. Microsoft made a massive strategic investment alongside the Azure Databricks partnership.

T. Rowe Price, Tiger Global, and Franklin Templeton participated in later growth rounds.

NEA was an early investor. The $10 billion Series J in 2024 valued the company at $62 billion and was led by Thrive Capital with participation from Andreessen Horowitz, DST Global, GIC, Insight Partners, and WCM Investment Management.

Related Profiles

Companies

Anthropic

AI ecosystem — Databricks Mosaic AI platform integrates with multiple LLM providers

OpenAI

AI ecosystem partner — enterprises use Databricks to prepare data for OpenAI models

Snowflake

direct competitor in data analytics and cloud data warehousing

Compare Databricks to another company

If You Invested

$1,000 CALCULATOR

See what your early investment would be worth today.