Part of the Map of Streaming Data Systems series

Overview of Apache Druid

Apache Druid is an open-source distributed, real-time, streaming database. Druid is used to power real-time interactive dashboards, data visualizations, and data-driven analytics.

Data is ingested into Druid in real-time and is available for querying and analysis immediately. Any data that is organized by timestamp is going to be well-suited for Druid.

Why Druid?

You would need to use Druid if you need high-concurrency reads and high performance for a dashboard.

A traditional architecture back in the days would usually look like this:

traditional data architecture

Such a system is not scalable, not easy to maintain, and also has a huge latency.

This is where Druid comes in. An architecture using Druid will look like this:

Druid architecture

In the Druid architecture, data is stored in a distributed, real-time, streaming database. This means that data is available for querying and analysis immediately.

Of course, this comes at a memory and CPU cost, so this is why you should only use Druid if you need high-concurrency reads and high performance for a dashboard and still keep your data warehouse.

Druid features

As with any solution, there are many factors to consider when choosing a product. Here is a list of the features of Druid:

  • Very low latency data ingestion: Data is ingested in real-time, and is available for querying and analysis immediately.
  • High-concurrency reads
  • Scalable: You can scale up to 1000s of nodes in the cluster.
  • Low latency
  • SQL API for querying and analyzing data
  • Out-of-the-box integration with Apache Kafka, Kinesis, S3, and more.

Also when making the decision to use Druid, you should consider the following and make sure that your data has the following characteristics:

  • Timestamp dimension: The data should be organized by timestamp for the best performance.
  • Streaming data.
  • Denormalized data: Druid can do joins, but for best performance, you should have a denormalized data model.

Druid use cases

As Druid is designed for performance and often offers much better speed advantages compared to Presto and Hive, here are some of the main use cases for Druid:

  • Application performance monitoring
  • Network monitoring
  • Risk analysis
  • OLAP / BI
  • Digital Advertising, Analytics, and Marketing
  • User behavior analytics and user event tracking

If we had to place Druid on a Data Pipeline, it would be a good choice for powering your real-time dashboards:

Druid data pipeline

Quick start demo

In case that you want to give Druid a try, here is a quick hands on demo that will give you a good idea of how to use Druid:

Druid Quickstart Demo