I have a Debezium instance reading CDC events from a database for several different tables and publishing those messages to the same Kafka/Azure Event Hubs topic. A scaled-out service consumes these messages and logs debug statements with a timestamp and source data table.
Each pod only processes one message at a time – they are single-threaded. I want to know how long it takes a pod to process each type of message for performance evaluation reasons. That is – I want to query for the difference in time between each “Received” log record, grouped by pod, in the consumer service’s AKS deployment. With help from this answer I figured it out for individual pods, but I would love to get a view across all the pods with a single query.
Here is an example Kusto query & results for the raw logs I am interested in:
ContainerLogV2
| where TimeGenerated > ago(2h)
| where PodName startswith "<common pod prefix>"
| where LogMessage contains "received:"
| project TimeGenerated, LogMessage, PodName
TimeGenerated | LogMessage | PodName |
---|---|---|
6/21/2024, 1:34:27 PM | Received: {“ID”: <integer>, “__dbz__physicalTableIdentifier”: “<database alias>.dbo.TABLE_ONE”} | <common pod prefix>-<unique pod suffix A> |
6/21/2024, 1:33:27 PM | Received: {“ID”: <integer>, “__dbz__physicalTableIdentifier”: “<database alias>.dbo.TABLE_TWO”} | <common pod prefix>-<unique pod suffix B> |
Based on the question linked above, I was able to get this data for a single pod at a time with the following query:
ContainerLogV2
| where TimeGenerated > ago(2h)
| where PodName startswith "<common pod prefix>"
| where PodName endswith "<unique pod suffix A>"
| where LogMessage contains "received:"
| order by TimeGenerated desc
| extend prev_tg = prev(TimeGenerated)
| parse LogMessage with * "dbo." TABLE_NAME '"}' *
| project TimeGenerated,
// LogMessage,
TABLE_NAME,
// PodName,
diff = case(true, abs(prev_tg - TimeGenerated)/1s, double(null))
| where isnotnull(diff)
Which produces data like:
TimeGenerated | TABLE_NAME | diff |
---|---|---|
6/21/2024, 1:34:27 PM | TABLE_ONE | 60 |
6/21/2024, 1:33:27 PM | TABLE_TWO | 1 |
6/21/2024, 1:33:26 PM | TABLE_THREE | .2538925 |
From this I can gather messages from TABLE_ONE take a minute to process, so that would be a good target for performance investigation, for example.
Where I am stuck is I do not know how to calculate this difference correctly when the <unique-pod-suffix> filter is removed. I can no longer just calculate the time difference from the previous log record, it has to be the difference from the previous log record from the same pod, since all pods are logging into the same container/table “ContainerLogV2”. I don’t know how to do that grouping and then flatting back out into a single results table correctly.
Any assistance is appreciated, thanks!