Home > Concepts

Introduction to Change Data Capture

A practical overview of CDC in databases and streaming data systems.

Change data capture (CDC) is a design pattern focused on tracking changes to data in a database.

The goal of CDC is to track changes on a source database in a way that a target service can act on it. The target service might be doing real-time database replication, or translating a relational database into an event stream, or any other type of action triggered by arbitrary changes to a database.1

CDC is most commonly implemented as a timestamped event log of changes made to a database.

Origins

Change data capture was created to solve a problem.

Companies typically have a main transactional database optimized for specific types of row-based workloads, but they want to do other things with the data that don’t work well on transactional databases like fast search and graph or analytical queries.

CDC helped solve this problem by translating data in real time from changes in a primary OLTP database to a versatile streaming event format that many other services like search indices, graph databases, and analytics databases can ingest.2

Complexities in CDC

Change Data Capture has been around since before 20003 but, until recently, hasn’t seen wide adoption beyond enterprises with complex data requirements.

One possible reason for slow adoption are some of the following unavoidable complexities in real-world implementations of CDC:

Database schema changes

It’s easy to imagine how a CDC system might format a stream of changes to DATA, for example:

{
  "action": "insert",
  "table": "customers",
  "data": {
    "id": 123,
    "name": "Clark Kent",
    "address": "180 Main Street"
  },
  "timestamp": 1613121879
}

But what about changes to STRUCTURE (schema)?

What should CDC do when an administrator adds a column, or even worse, changes the datatype of a column in the upstream database?

Catching up CDC on first start

CDC is typically introduced long after the upstream database has started to store data. This means that when it first starts, the CDC system has to catch up to the current feed of changes.

It’s not possible, nor would it be practical, to catch up by replaying the entire history of changes to a database. Instead, CDC runs a batch task upon first start to bulk load just the most recent state of each row.

For example: the catch-up process in Debezium follows this multi-step process:

  1. Set a cursor on a very recent database transaction in the write-ahead log. And make sure the database doesn’t flush the log until the catch-up process completes.
  2. Take a snapshot of the database UP TO the cursor from step one.
  3. Translate each row from the snapshot into an INSERT event in the CDC system output stream.
  4. Begin streaming recent transactions starting with the cursor from step one, eventually catching up to real time.

The snapshot step can take hours or days on a large production database, running it on the primary database could lock out other transactions and cause major issues for the end-users and applications that depend on it.

For that reason, CDC systems in production typically point to a secondary read-only replica of the primary database.


  1. Wikipedia: Change Data Capture ↩︎

  2. LinkedIn Engineering: Open Sourcing Databus ↩︎

  3. Data Warehousing: Concepts, Technologies, Implementations, and Management ↩︎

Change Data Capture in Depth

Go deeper into Change Data Capture via the following articles.

Types of Change Data Capture

An overview of popular change data capture software and an objective comparison of their features.

Example Use-Cases for Change Data Capture

Practical examples of scenarios and specific companies where CDC is used in production

Change Data Capture in Microsoft SQL Server

A brief overview of how Change Data Capture works with PostgreSQL databases.

Change Data Capture in MySQL

A brief overview of how Change Data Capture works with MySQL databases.

Change Data Capture in PostgreSQL

A brief overview of how Change Data Capture works with PostgreSQL databases.

Change Data Capture Software

An overview of popular change data capture software and an objective comparison of their features.

External Resources

Below is a curated list of high-quality external resources on the topic of Change Data Capture.