I have a Debezium instance reading CDC events from a database for several different tables and publishing those messages to the same Kafka/Azure Event Hubs topic. A scaled-out service consumes these messages and logs debug statements with a timestamp and source data table.
Each pod only processes one message at a time – they are single-threaded. I want to know how long it takes a pod to process each type of message for performance evaluation reasons. That is – I want to query for the difference in time between each “Received” message, grouped by pod, in the consumer service’s AKS deployment. With help from this answer I figured it out for individual pods, but I would love to get a view across all the pods with a single query.
Here is an example Kusto query & results for the raw logs I am interested in:
ContainerLogV2
| where TimeGenerated > ago(2h)
| where PodName startswith "<common pod prefix>"
| where LogMessage contains "received:"
| project TimeGenerated, LogMessage, PodName
TimeGenerated | LogMessage | PodName |
---|---|---|
6/21/2024, 1:34:27 PM | Received: {“ID”: <integer>, “__dbz__physicalTableIdentifier”: “<database alias>.dbo.TABLE_ONE”} | <common pod prefix>-<unique pod suffix A> |
6/21/2024, 1:33:27 PM | Received: {“ID”: <integer>, “__dbz__physicalTableIdentifier”: “<database alias>.dbo.TABLE_TWO”} | <common pod prefix>-<unique pod suffix B> |
Based on the question linked above, I was able to get this data for a single pod at a time with the following query:
ContainerLogV2
| where TimeGenerated > ago(2h)
| where PodName startswith "<common pod prefix>"
| where PodName endswith "<unique pod suffix A>"
| where LogMessage contains "received:"
| order by TimeGenerated desc
| extend next_tg = next(TimeGenerated)
| parse LogMessage with * "dbo." TABLE_NAME '"}' *
| project TimeGenerated,
// LogMessage,
TABLE_NAME,
// PodName,
diff = case(true, abs(next_tg - TimeGenerated)/1s, double(null))
| where isnotnull(diff)
Which produces data like:
| TimeGenerated | TABLE_NAME | diff |
| ————- | ———- | ——- |
| 6/21/2024, 1:34:27 PM | TABLE_ONE | 60 |
| 6/21/2024, 1:33:27 PM | TABLE_TWO | 1 |
| 6/21/2024, 1:33:26 PM | TABLE_THREE | .2538925 |
From this I can gather messages from TABLE_ONE take a minute to process, so that would be a good target for performance investigation, for example.
Where I am stuck is I do not know how to calculate this difference correctly when the <unique-pod-suffix> filter is removed. I can no longer just calculate the time difference from the previous log record, it has to be the difference from the previous time record from the same pod. I don’t know how to do that grouping and then flatting back out into a single results table correctly.
Any assistance is appreciated, thanks!