We have a spring scheduler in place which runs at 4 a.m. everyday. This scheduler missed to run 2 times on consecutive days, and when I restarted the application on 3rd day it started running as usual.
We have checked logs in the system for the days when scheduler didn’t run, and found no logs for the thread which executes scheduler.
Scheduler was running as expected for the last month and suddenly it stopped running for 2 days and then started running again when application re-started.
I have checked below hypothesis but none seems to give me the concrete explanation for scheduler being stopped.
-
Multiple scheduler clashing with each other – we have 2 scheduler in same application one execute every min. and other at 4:00 a.m., this couldn’t be the reason as we have number of
ThreadPoolTaskScheduler
equals to number of scheduler which is 2. -
Last run of the scheduler took almost infinite time to get completed – So next run never get’s executed – This also can’t be the reason as we have logs for the last run being successfully executed.
-
Scheduler failed to run with RuntimeException both times – It’s possible as we have a few network calls inside this scheduler, but we don’t have any error logs which should have come in cases of unhandled exception inside TaskUtils as given below :
ERROR org.springframework.scheduling.support.TaskUtils$LoggingErrorHandler: Unexpected error occurred in scheduled task
-
Scheduler threads was somehow busy doing something else, or blocked/stuck and hence our scheduler was unable to get a hold of them, thus never executed. – High unlikely – first of all I don’t have any log or data to backup this statement. 2nd it’s really hard to believe that it didn’t run 2 times in a row, OR stopped forever and waiting for a re-start, I mean what else would the thread be doing or got stuck in some other process.
-
CPU or memory reached to it’s capacity – So application did not able to schedule the thread to execute scheduler at 4a.m., I’ve checked CPU, it was around 2% so can’t be the cause.
Our scheduler :
public void run() {
Lock lock = redisLockRegistry.obtain(LOCK_KEY);
boolean lockAcquired = lock.tryLock();
log.error("Acquired lock for scheduler: {}", lockAcquired);
if (!lockAcquired) {
log.error("Unable to acquire lock for scheduler");
return;
}
try {
// Few DDL MySql operations
} catch (Exception e) {
log.error("Error occurred while running job.", e);
} finally {
log.info("job completed");
lock.unlock();
}
}
I should’ve taken thread dump of application to get complete visibility of what’s executing at a moment in time, but unfortunately I missed taking dump and now it’s gone as application is re-started, so couldn’t do the analysis on thread level now.
Is there any more other place we should look into to debug what exactly happened. Or can someone share all the valid reasons which could lead to this.