We have thousands of IoT devices sending measurements throughout the day. This is being stored in a MongoDB instance as a simple timeseries, with a value, a timestamp and an identifier.
Example data could look like this:
{
"observed_at": {
"$date": "2024-04-03T16:52:56.217Z"
},
"meta": {
"snid": "xyz",
"type": "total_charge_power"
},
"_id": {
"$oid": "....."
},
"value": 100
}
We need to be able to make a report down to each second throughout a given day, with the sum of consumption from each device.
Some examples of the data:
DeviceA:
16:52:56 : 100watt
17:01:15 : 0watt
DeviceB:
16:56:30 : 50watt
16:58:38 : 100watt
17:00:04 : 0watt
This is interpreted as “DeviceA used 100watt each second between 16:52:56 and 17:01:15 and DeviceB used 50watt each second from 16:56:30 until 16:58:38 where it ramps up to 100watt each second until 17:00:04”
The calculation for these two devices would be:
16:52:56 : 100watt
16:52:57 : 100watt
...
16:56:30 : 150watt
16:56:31 : 150watt
...
16:58:38 : 200watt
16:58:39 : 200watt
...
17:00:04 : 100watt
17:00:04 : 100watt
...
17:01:15 : 100watt
17:01:16 : 0watt
Our initial thought was to use $densify and $fill, but you quickly realize that the about of documents get too big for this to be performant. Lets say we have 5000 devices each day, when running $densify, this would give us 5000 * 60 * 60 * 24: 432.000.000 documents in the calculation each day – only increasing with more devices being added. $densify bails out at 500.000 documents with default settings, and could maybe be increased while we’re taking smaller chunks, but it’s not really viable for a scaleable solution.
The other solution we’re thinking about would be to use $shift making each document a span from one timestamp to another with the value between these timestamps. Given the above example, we would end up with something like this:
// timestamps slimmed down to only hour, minute and seconds
{
from: null,
until: "16:56:30",
value: 0
}
{
from: "16:56:30",
until: "16:58:38",
value: 50
},
{
from: "16:58:38",
until: "17:00:04",
value: 100
},
{
from: "17:00:04",
until: null,
value: 0
}
From this, we then need to (preferable in MongoDB) calculate each second in a loop, picking out the measurements with some sort of “between timestampA and timestampB” lookup.
The first solution would be something like this:
[
{
$densify:
/*
* gap-filling with per seconds, in the period
*/
{
field: "from",
partitionByFields: ["snid"],
range: {
step: 1,
unit: "second",
bounds: [
ISODate("2024-05-23T01:00:00Z"),
ISODate("2024-05-23T01:01:00Z") // Notice 1 minute gap only
]
}
}
},
{
$fill:
/*
* filling the missing values using locf
*/
{
partitionByFields: ["snid"],
sortBy: {
from: 1
},
output: {
value: {
method: "locf"
}
}
}
},
{
$group:
/**
* _id: The id of the group.
* fieldN: The first field name.
*/
{
_id: "$from",
sum_power_w: {
$sum: "$value"
}
}
}
]
The second solution would use something like this, and then afterwards either be handled in application layer, or with MongoDB, if someone knows how:
[
...,
{
$setWindowFields:
/*
* calculate the end timestamp of each reading
*/
{
partitionBy: "$_id.snid",
sortBy: {
"_id.from": 1
},
output: {
until: {
$shift: {
output: "$_id.from",
by: 1
}
}
}
}
}
]
So the question is:
How do you handle this size of interpolation in MongoDB?
And… Is it possible to calculate solution 2 in MongoDB alone (without the application layer)?
Hope this makes sense – if not, please let me know so I can provide the needed information.