I have two MongoDB documents:
Customer Document:
@Document(collection = "customers")
@Setter
@Getter
public class Customer {
@Id
private String id;
@Field("purchaseIds ")
@Indexed
private List<String> purchaseIds = new ArrayList<>();
//other fields
}
Purchase Document:
@Document(collection = "purchases")
@Setter
@Getter
public class Purchase {
@Id
private String id;
//other fields
}
Each customer can have many purchases. In the database, there are more than a hundred million records for both customers and their purchases. I need to delete a set of purchases based on certain criteria, and this process can be very performance-intensive.
I am using a batch approach to delete purchases. Here is my current implementation:
private static Criteria getCustomersPurchaseDeleteCriteria() {
return Criteria.where("isProcessed").is(true);
}
private void removeNonProcessedElevenTwelveLogs(Criteria logDeleteCriteria) {
int batchSize = chunkSize; // Define the batch size for pagination
ObjectId lastId = null;
boolean hasMoreResults = true;
// Bulk operations to remove all matching purchases
BulkOperations bulkOperations = mongoTemplate.bulkOps(BulkOperations.BulkMode.UNORDERED, Purchase.class);
while (hasMoreResults) {
// Query to first find matching Customers whose purchases are about to be deleted
Query query = new Query(getCustomersPurchaseDeleteCriteria())
.limit(batchSize)
.with(Sort.by(Sort.Direction.ASC, "_id"));
if (Objects.nonNull(lastId)) {
query.addCriteria(Criteria.where("_id").gt(lastId));
}
// Apply projection to only fetch the purchaseIds field
query.fields().include("purchaseIds");
double start = System.currentTimeMillis();
List<String> deletingPurchaseIds = new ArrayList<>();
// Fetch documents with specified criteria
List<Customer> matchingCustomers = mongoTemplate.find(query, Customer.class);
for (Customer customer : matchingCustomers) {
deletingPurchaseIds.addAll(customer.getPurchaseIds());
// Keep track of the last processed ID
lastId = new ObjectId(customer.getId());
}
log.info("time for finding and extracting ids from matching customers: {} sec", (System.currentTimeMillis() - start) / 1000);
if (deletingPurchaseIds.isEmpty()) {
hasMoreResults = false;
continue;
}
// Criteria for Purchases related to the matching Customers
Criteria deleteCriteria = Criteria.where("_id").in(deletingPurchaseIds);
// Create a query with the deleteCriteria
Query purchaseDeleteQuery = new Query(deleteCriteria);
start = System.currentTimeMillis();
// Bulk operations to remove all matching Purchases
bulkOperations.remove(purchaseDeleteQuery);
BulkWriteResult execute = bulkOperations.execute();
log.info("time for deleting batch of purchases: {} sec and deleted size: {}", (System.currentTimeMillis() - start) / 1000, execute.getDeletedCount());
// Check if there are more results to fetch
hasMoreResults = deletingPurchaseIds.size() == batchSize;
}
log.info("finished purchases removal for matched customers.");
}
Problem: Even with a batch size of 1000, it takes about 3 hours to delete a test case of 5 million customers and 5 million purchases. I need to optimize this process.
Questions:
-
Is there a more efficient way to perform bulk deletions in MongoDB?
-
Are there any improvements I can make to my current approach to speed up the deletion process?
Any help or guidance on this would be greatly appreciated. Thanks in advance!
What I Tried: I implemented a batch deletion approach where I retrieve matching customers in chunks, extract their purchaseIds
, and then perform bulk deletions on the purchases. Specifically, I use a batch size of 1000 and track the last processed ID to paginate through the results.
What I Expected: I expected this method to efficiently delete millions of records within a reasonable time frame, significantly improving performance compared to a non-batch approach.
What Actually Happened: Despite using batching and bulk operations, the deletion process takes approximately 3 hours for a test case involving 5 million customers and 5 million purchases. The performance is not meeting the application’s requirements, and the process is too slow for production use.