I wonder how to count entries created with Transform Job.
Background
Let’s say I have an index with alerts containing, among others, two fields: User ID and Server ID the user operates on. One user can operate on multiple servers. i.e.
{id: ..., user: "Mark", server: "hostname-1"}
{id: ..., user: "John", server: "hostname-2"}
{id: ..., user: "Mark", server: "hostname-2"}
{id: ..., user: "Mark", server: "hostname-2"}
(note the last two entries: it is possible that I have mutltiple entries for the same User and Server)
I need to calculate:
- How many distinctive servers are there;
- how many distinctive users operated on specific server.
What I did
What I did so far is I created a Transform Job which runs continuously. It removes duplicated entries by grouping user and server fields:
PUT _/plugins/_transform/user_server
"transform": {
"enabled": true,
"continuous": true,
"schedule": {
"interval": {
"period": 1,
"unit": "Minutes",
"start_time": 1718277152
}
},
"description": "Getting unique user-server combinations",
"source_index": "alerts",
"target_index": "alerts-groups",
"page_size": 300,
"groups": [
{
"terms": {
"source_field": "user",
"target_field": "user"
}
},
{
"terms": {
"source_field": "server",
"target_field": "server"
}
},
]
}
Calculating the number of distinctive servers should be as easy as search with aggregation and reading the doc_count
value:
GET alerts-groups/_search
{
"size": 0,
"aggs": {
"servers": {
"terms": {
"field": "server"
}
}
}
}
The Problem
Turns out the doc_count
of each bucket does not hold the number of alerts-groups
documents, but the number of alerts
documents matching each bucket. That’s because the Transform Job appends _doc_count
field to each generated document, which is then used by aggregation to calculate final doc_count
.
{
...
"hits": {
...
"hits": [
{
...
"_index": "alerts-groups",
"_source": {
"transform._id": "user_server",
"_doc_count": 2,
"transform._doc_count": 2,
"server": "hostname-2",
"user": "Mark"
}
},
{
...
"_index": "alerts-groups",
"_source": {
"transform._id": "user_server",
"_doc_count": 1,
"transform._doc_count": 1,
"server": "hostname-2",
"user": "John"
}
},
... // Similar entry for "Mark" and "hostname-1"
]
},
"aggregations": {
"servers": {
"buckets": [
...
{
"key": "hostname-2",
"doc_count": 3 // expected 2 since there's one entry for Mark and one for John
}
]
}
}
}
The Question
Is there a way to calculate actual number of documents inserted by Transform Job?
References:
- Terms – OpenSearch Documentation
- [BUG] Add _doc_count in root of transform documents · Issue #556 · opensearch-project/index-management · GitHub