Unlocking Change Data Capture in ADF: New Feature Explained

 Change Data Capture(CDC) In Azure Data Factory



Introduction

This blog post introduces an exciting new feature in Azure Data Factory (ADF)Change Data Capture (CDC) — a long-awaited capability that significantly enhances real-time data integration scenarios.

While ADF has previously supported CDC through Data Flows, this newly introduced UI-based CDC feature brings a more streamlined and user-friendly experience. However, it's important to note that there are key differences between the two implementations:

  1. CDC in Data Flows follows a pattern of performing a full initial load, followed by incremental updates, allowing users to establish a historical baseline before applying changes.

  2. In contrast, the CDC feature in the ADF UI is designed specifically for incremental loads only, and it offers near real-time synchronization by pushing changes directly to the sink as they occur.

Additionally, with Data Flows, changes are applied based on checkpoint-based pipeline executions, whereas the new UI-based CDC enables faster and more immediate updates, making it ideal for real-time data movement scenarios.

Lets jump to the practical implementation of CDC:-

Prerequisites:-

  • You need to have Azure SQL Database (recommended for poc purpose use General Purpose - Serverless: Gen5, 1 vCore).
  • Cdc must be enabled at the database level and table level in azure sql database.
  • A database is created "testdb".
The below picture represents the configuration needed to be done at database level at "testdb" database.


Fig:1 CDC Enabling at Database Level


SQL Code:-

use testdb;
exec sp_cdc_enable_db;


The same needs to be done at table level as well. So we need to create our table named "source" at testdb and enable cdc for the same. The below sql code displays the code for the same.

Fig:2 CDC Enabling at Table Level

SQL Code:-

create table source(id int primary key,
name varchar(255));

insert into source values (1,'Alex'),(2,'Kevin');

EXEC sys.sp_cdc_enable_table
@source_schema = N'dbo',         -- Replace with your schema
@source_name   = N'source',     -- Replace with your table name
@role_name     = NULL,           -- Or specify a DB role for access
@supports_net_changes = 1;       -- Set to 1 if you want net change support

select * from sys.tables

 Similarly we will create our destination table as well, please find the below code for the same.

SQL Code:-

create table destination(id int primary key,

name varchar(255));


 The following picture displays the overall data representation in source and destination tables, and for ease we have name our source and destination tables as "source" & "destination" respectively for ease.

Fig:3 Table view of source & destination tables


Now we know that there are 2 records in source and 0 records in destination. One thing to note here is that in both the tables it is mandatory to have primary key columns as to keep the track of changes by cdc.


Now lets begin implementing CDC by creating object in ADF:-

Step-1: 

Create a linked service to connect to azure sql database. 


Fig:4 Linked Service Configuration

Step-2:

After successful test connection, go back to the Author menu, and select create new CDC.


Fig:5 CDC Creation in ADF



Step-3

Upon clicking "New CDC," you will be prompted to enter the CDC name and database name. After entering these details, a list of tables with CDC enabled will be displayed. As noted in the code above, only the "source" table has CDC enabled in the Azure SQL database. Therefore, only the source table appears in the prompt below, as it is sufficient to complete the task.

Fig:6 CDC Source Configuration




Step-4: 

The next step is to configure the destination as it is presented in the below image.

Fig:7 CDC Target Configuration


Step-5:

Configure the key column by selecting the primary key. One thing to note, if you are getting error such as 
Failed to import schema for Source table 'dbo.source' with error:
{"_body":"{\n  \"error\": {\n    \"code\": \"Failed\",\n    \"message\": \"Spark job failed: {\\n  \\\"text/plain\\\": \\\"{\\\\\\\"runId\\\\\\\":\\\\\\\"35a6dd58-bf5a-40e8-b825-b728a3a2b754\\\\\\\",\\\\\\\"sessionId\\\\\\\":\\\\\\\"345b646c-42e1-47dc-8edf-d6231de75e17\\\\\\\",\\\\\\\"status\\\\\\\":\\\\\\\"Failed\\\\\\\",\\\\\\\"payload\\\\\\\":{\\\\\\\"statusCode\\\\\\\":400,\\\\\\\"shortMessage\\\\\\\":\\\\\\\"java.lang.Exception: DataFlowManagerClient.getPayload: 400\\\\\\\",\\\\\\\"detailedMessage\\\\\\\":\\\\\\\"Failure 2025-04-21 12:53:03.829 failed }

with error code-9070, just retry it couple of times, it works

This bug can be due to the preview version of it currently present in ADF.


The below image suggests the final configuration where we choose the ID column as primary key to keep the check of changes happening in source and the same needs to be replicated in the sink.

Fig:8 CDC Key Column Setup Configuration



Step-6:

Set the latency to "Real time" and publish all. Once published the start button will be enabled and click on start to enable the cdc in ADF.


Fig:9 CDC Real-time Latency Configuration



Finally, since the CDC is enabled in the ADF and now we can test the same.

Initially the data present in the source is as follows:-

Fig:10 Initial Data (Source)

And destination doesn't have any data.

Now , lets insert two rows to the source, the sql code is prompted before:-'

insert into source values (3,'Gabriel'),(4,'Mike');
select * from source;
select * from destination;

The overall picture after inserting two records in the source and checking the no of records in the destination, we get a fascinating observation. There seems to have 4 records in the source, which we get (2(initial)+ 2 (new)) records, but destination only had the 2 new records which are inserted with ID's 3 & 4.

Fig:11 The New Updated Status Of Records in (Source & Destination)

To summarize, the new CDC (preview) feature in Azure Data Factory is designed to process only thecaptured changes, and does not include an initial full load. Therefore, to handle any backfill scenarios, it's necessary to use a Copy Data activity to perform the initial load into the sink

Once the historical data is in place, the CDC feature can then be enabled to continue capturing and processing incremental changes from that point onward.

Lets try different CRUD Operation to test the flexibility of this CDC feature:-


We shall update the name of ID 1 from 'Alex' to 'Hogward' and delete the ID=4. The sql code is as below:-

UPDATE source

SET name = 'Hogward'

WHERE id = 1;

delete from source where id=4;


The below picture describes the reality as update didn't showed in destination as ID=1 doesn't present in the destination table, hence i that is not present in destination , it would not upsert.

The delete operation was reflected in the sink as ID=4 was present in destination table as well. 

Fig:12 Final CRUD Operation Results in (Source & Destination)


The changes can be monitored in azure itself by navigating to the "monitor" tab and choosing cdc option, the bar chart would represent the changes made and can be viewed in isolation.

Fig:13 Monitoring Section(Track Changes)



We can conclude this blog post by saying that the CDC feature as separate ADF object can be highly beneficial if we understand its functionality and how it aligns with our business needs, along with its current features and limitations.

Thankyou for reading.

If you enjoyed this blog, feel free to share, subscribe, like, or leave a comment!

Stay tuned for more such feats;)


Comments

Popular posts from this blog

How to Set Up an SFTP Server and Seamlessly Connect It to Azure Data Factory

How to Connect Data Ex-filtration Protection(DEP) enabled Synapse Workspace to Azure Cosmos DB (For NoSQL)

How to Configure Email Notifications For ADF Pipeline Runs Using Logic Apps