I am connecting to MongoDB Atlas using mongodb-spark-connector v10.3
I am using a SamplePartitioner with a pipeline that matches on a certain date range and projects only the _id
column:
[{'$match': {'dateReceived':{'$gte': '2024-07-12T05:00:00.000000Z','$lte': '2024-07-12T09:15:00.000000Z'}}},{$project:{_id:1}}]
Since I am using the SamplePartitioner, the mongodb-spark-connector automatically appends the _id
column to the $match
stage and the resultant pipeline becomes:
[
{
"$match":{
"$and":[
{
"_id":{
"$gte":"ca8866b865271263919bb74cb355cdf5955de54d80123368c9409ad12d622bcd", "$lt":"d039be1aaf69134b5b24009ca41d5763090bb8b77bd923b5ae3b38c6624d28c9"
}
},
{
"dateReceived":{
"$gte":"2024-07-12T05:00:00.000000Z",
"$lte":"2024-07-12T09:15:00.000000Z"
}
}
]
}
},
{
"$project":{
"_id":true
}
}
The issue with this as I see, is the _id
boundaries are created based on the samples it collects from each partition but it does not employ a min
and a max
range to get all the possible records.
Is my understanding correct?
Has anyone encountered this issue where the data returned from Mongo is not complete.