Say I have an application, and it writes (among other things) errors to a log file. It’s the first place the user would be asked to go and look if there was a problem. Let’s also assume that this is a critical application that cannot be allowed to just crash.
For the sake of argument we’ll also say that all exceptions are being logged as well. I’m in the boat that says this is a bad idea, but we’ll run with it for the sake of this fictional problem.
How should the application ideally handle the situation where it is completely unable to write to its log? Say due to a lack of disk space, exhausted file handle count, or a permissions/access issue.
Depending on the exact reason of course it may be able to partially resolve itself. If there was no free disk space for example it might be reasonable for it to trash the current log to write out a line saying its out of disk space, at the expense of losing the log contents. Although this is very unhelpful if something external is eating space and would just result in logs being emptied.
It may be prudent for it to log to an internal buffer, so that once the situation is resolved no data is lost – but left too long this’d simply munch on memory.
So what do you think, how should a fault notifying the user of an error be handled? Especially with an eye to headless/unattended applications that’d normally just monitored by looking at their logs.
2
One way is to maintain a ‘system’ log file that usually is kept empty, but is pre-allocated. Then, for truly serious errors, you write the error to this location that is guaranteed to be available (well…) and has enough space for messages to be written to it. This should use a different, and very simple, error-logging system to avoid errors in the logging subsystem.
1
Replace the last log entry with a new log entry that says the log is full, and silently discard any subsequent log entries. This preserves the rest of the log, and provides an indicator of what happened to the lost log entries.
If Dan’s comment had been an answer then I would have voted for it, but since it isn’t…
There are 2 kinds of errors where recovery is pure luck, if you can even get far enough to attempt recovery. “Out of Memory” and “Out of Disk Space”. Thus, most applications don’t even try to recover if either of these happen. They simply crash.
Like Dan’s comment said, the only way for applications to safely and reliably handle these types of errors is by ensuring that they never occur. That requires monitoring the status of disk space and memory and when it gets below a certain threshold then the application can try to free up space. If that doesn’t work (because the application isn’t responsible for the loss of memory) then the application needs to perform an orderly exit before memory is exhausted.
As for what to do when writing a log entry fails. My logging always runs in a background thread and makes use of message queues. If writing the log entry fails then the LogThread will push the LogEntry back into the front of the message queue and then attempt to open a new log file for writing. If a new file is opened then proceed processing log entries from the queue as normal. If not then wait a while and try again. Not much more you can do other than keep trying. You also want to limit the number of entries you allow in your message queue or you will eventually hit the “Out of Memory” error because all of your Log Entries are stored in memory.
I suppose other possible options include:
- Write to some other type of log (e.g. Windows Application Log).
- Send data out a serial port
- Connect to an external error server and send messages to it
- Ensure your system has backup storage (or check if any additional storage is attached) and write to it.
- Check if a printer is attached and write to it.
- Some boards have NVRAM, write to that.
How should the application ideally handle the situation where it is completely unable to write to its log? Say due to a lack of disk space, exhausted file handle count, or a permissions/access issue.
If you cannot write to the logs, you cannot write to the logs. It’s pretty much out of your control until the cause is identified and it’s resolved. In the meantime, for the sake of notification and falling back to at least some sort of logging, write what you’d typically log out to somewhere else.
As a temporary measure, maybe write to a remote store? You rely on a connection to this remote location, which could present another set of problems, but at least you (hopefully) escape a lot of the issues that might be in the way of you logging locally (disk space, permissions etc.). Alternatively (last resort), notify the relevant people of exceptions via e-mail. I would do this anyway the instant there’s an issue logging exceptions.