What are some patterns and anti-patterns of application logging? [closed]

I recently had to investigate a field issue for our large enterprise application. I was horrified by the logs that I had to comb through in an attempt to find the problem and at the end of the day the logs did not help at all identifying/isolating the bug.

Note: I understand not all bugs are discoverable through logs. This does not change the fact that the logs are horrible.

There are some obvious problems with our logging that we can already attempt to fix. I do not want to list those here and I cannot simply show you our log files so you can give advice on what to do.

Instead, in order to assess how bad we are doing on the logging front, I would like to know:

What are some guidelines, if any, when it comes to logging for an application, especially large application.
Are there any patterns we should follow or anti-patterns we should be aware of?
Is this an important thing to fix or can it even be fixed or all log files are simply huge and you need supplemental scripts to analyze them?

Side note: we use log4j.

A few points that my practice proved useful:

Keep all logging code in your production code. Have an ability to enable more/less detailed logging in production, preferably per subsystem and without restarting your program.
Make logs easy to parse by grep and by eye. Stick to several common fields at the beginning of each line. Identify time, severity, and subsystem in every line. Clearly formulate the message. Make every log message easy to map to its source code line.
If an error happens, try to collect and log as much information as possible. It may take long but it’s OK because normal processing has failed anyway. Not having to wait when the same condition happens in production with a debugger attached is priceless.

Logs are mostly needed for monitoring and troubleshooting. Put yourself in a troubleshooter’s shoes and think what kind of logs you’d like to have when something wrong is happening or has happened in the dead of night.

I work with safety critical real-time systems and logging is often the only way to catch rare bugs that turn up once a blue moon on every 53rd tuesday when it’s a full moon, if you catch my drift. This kind of makes you obsessive about the subject, so I’ll apologise now if I start to froth at the mouth. The following was written for native code debug logs, but most of it is applicable to the managed world too…

Use text log files. Seems obvious, but some people do try to generate binary log files: that’s just dumb because I don’t need to be looking for a reader tool when I’m out in the field. Plus if it’s text and the debug is verbose there’s a good chance the field engineer can read the file and diagnose the problem without ever coming back to me. Everybody wins.

I design systems which are capable of logging pretty much everything, but I don’t turn everything on by default. The debug information is sent to a hidden debug dialog which timestamps it and outputs it to a listbox (limited to around 500 lines before deletion), and the dialog allows me to stop it, save it to a log file automatically, or divert it to an attached debugger. That diversion allows me to see the debug output from multiple applications all neatly serialized, which can be a life saver sometimes. I used to use numeric logging levels (the higher you set the level, the more you capture):

off
errors only
basic
detailed
everything

but this is too inflexible – as you work your way towards a bug it’s much more efficient to be able to focus logging in on exactly what you need without having to wade through tons of detritus, and it may be one particular kind of transaction or operation that causes the error. If that requires you to turn everything on, you’re just making your own job harder. You need something finer-grained.

So now I’m in the process of switching to logging based on a flag system. Everything that gets logged has a flag detailing what kind of operation it is, and there’s a set of checkboxes allowing me to define what gets logged. Typically that list looks like this:

#define DEBUG_ERROR          1
#define DEBUG_BASIC          2
#define DEBUG_DETAIL         4
#define DEBUG_MSG_BASIC      8
#define DEBUG_MSG_POLL       16
#define DEBUG_MSG_STATUS     32
#define DEBUG_METRICS        64
#define DEBUG_EXCEPTION      128
#define DEBUG_STATE_CHANGE   256
#define DEBUG_DB_READ        512
#define DEBUG_DB_WRITE       1024
#define DEBUG_SQL_TEXT       2048
#define DEBUG_MSG_CONTENTS   4096

This logging system ships with the release build, turned on and saving to file by default. It’s too late to find out you should have been logging AFTER the bug has occurred, if that bug only occurs once every six months on average and you have no way of reproducing it. Logging that only works with debug builds is just. plain. dumb.

The software typically ships with ERROR, BASIC, STATE_CHANGE and EXCEPTION turned on, but this can be changed in the field via the debug dialog (or a registry/ini/cfg setting, where these things get saved).

Oh and one thing – my debug system generates one file per day. Your requirements may be different. But make sure your debug code starts every file with the date, version of the code you’re running, and if possible some marker for the customer ID, location of the system or whatever. You can get a mish-mash of log files coming in from the field, and you need some record of what came from where and what version of the system they were running that’s actually in the data itself, and you can’t trust the customer/field engineer to tell you what version they’ve got – they may just tell you what version they THINK they’ve got. Worse, they may report the exe version that’s on the disk, but the old version is still running because they forgot to reboot after replacing. Have your code tell you itself.

Lastly, you don’t want your code to generate its own problems, so put in a timer function to purge the log files after so many days or weeks (just check the difference between time now and time of file creation). This is OK for a server app that runs all the time, on a client side app you can get by with purging any old data when you start up. We typically purge after 30 days or so, on a system without frequent engineer visits you might want to leave it longer. Obviously this is dependent upon the size of your log files as well.

My favorite public resource for logging guidelines are Apache JCL Best Practices.

Best practices for JCL are presented in two categories: General and Enterprise. The general principles are fairly clear. Enterprise practices are a bit more involved and it is not always as clear as to why they are important.

Enterprise best-practice principles apply to middleware components and tooling that is expected to execute in an “Enterprise” level environment. These issues relate to Logging as Internationalization, and fault detection. Enterprise requires more effort and planning, but are strongly encouraged (if not required) in production level systems. Different corporate enterprises/environments have different requirements, so being flexible always helps…

Despite targeting JCL, these seem to be generic enough to be adopted for logging in general.

^{My personal “guidelines” for logging is that at debug level, I try to make my logs read like a story – with understandable logic and sufficient (but not overloaded) details.}

Most famous anti-pattern is probably “swallowing exceptions” – just search the web for it.

As for huge logging files, in my practice this was mostly the normal case. And yes, supplemental scripts as you call them and/or tools like Chainsaw also look normal to me.

^{Above doesn’t mean though that you must always blindly put all the logs in one huge file. Sometimes it could be useful to write / copy some of the logs to separate files. Eg in my recent project QA guys asked for dedicated files for metrics and timing data and for brief reports on system operations. They said they will benefit from that and dev did it (benefit from brief reports file turned out indeed significant).}

PS. Regarding anti-patterns, others that come to mind are “flooding” and senseless messages.

I call it flooding when I see multiple similar messages coming from a loop with many iterations. To me, flooding is annoying enough to try to get rid of it when I detect it in source code. Usually improving it requires some art – because, well, things that happen within the loop may be interesting. When I don’t have time to improve it deeper, I try to at least change logging level of such messages to lowest one to make it easier to filter out.
Senseless messages seem to be pretty popular garbage. These look harmless when read in source code – I guess one has to go through the pain of analyzing debug output looking like…
```
step #1
step #2
step #3
```
…to deeply appreciate their inherent ugliness. My favorite heuristics to detect this kind of problems at source code level (proposed by colleague in one of my past projects) is to calculate number of space symbol occurrences in string literals used in logging. In my experience, zero spaces basically guarantees the logging statement is senseless, one space is also a good indicator of the potential issue.

Log the exception only once!

One of the common pain points that I have noticed is logging and rethrowing an exception.
As a result, the log files contains the same exceptions several times on several stack levels.

Here’s an anti-pattern: Making two dozen “genericvariable” fields in a database table to track anything conceivable and then having 88 (and counting) different enum values for different types of logs.

My experience with logs is the bigger the better, but be consistent enough to make it filterable by machine, and be able to configure a severity level for every component of your application individually.

Also, it’s very difficult to predict what logging you will need to find a future bug. Most of the obvious places to log for bugs are fixed before the product goes out the door. It’s not uncommon for the result of a bug report to be that you just added logging to help diagnose it if it happens again.

Couple of notes from the operations side of the house here:

1) Make sure the logs are locally configurable, preferably with a tool no heavier than a text editor. Most of the time we don’t want to get TRACE level logging, but we love to be able to turn it on.

2) If at all possible, make sure the logs can be read with a tool no heavier than a text editor. Nothing is worse than having to go on a tool hunt at an odd hour when the production system is failing.

From my own experience working with web-applications:

(& considering storage is very cheap now-days)

Log as much available (at that very moment) info as you can.
I always include DateTime.Now in my log-strings.
I always (if it’s possible) log time-duration of some specific “action”.
Be consistent with your log-strings. Since always I use this kind of pattern:
- “[Info X] [Info Y] [Info Z] [etc.]”

Apart from the stacktrace, log the current application state and the input.

Software is deterministic, these two are usually the only thing you need to reproduce the bug. Storing the full state might in some cases be troublesome so ways to reproduce the current state, by for example previous inputs, are also good.

Of course more data is always better but at minimum these two are a good start for the easiest crashes.

Trang chủ Giới thiệu Sinh nhật bé trai Sinh nhật bé gái Tổ chức sự kiện Biểu diễn giải trí Dịch vụ khác Trang trí tiệc cưới Tổ chức khai trương Tư vấn dịch vụ Thư viện ảnh Tin tức - sự kiện Liên hệ Chú hề sinh nhật Trang trí YEAR END PARTY công ty Trang trí tất niên cuối năm Trang trí tất niên xu hướng mới nhất Trang trí sinh nhật bé trai Hải Đăng Trang trí sinh nhật bé Khánh Vân Trang trí sinh nhật Bích Ngân Trang trí sinh nhật bé Thanh Trang Thuê ông già Noel phát quà Biểu diễn xiếc khỉ Xiếc quay đĩa Dịch vụ tổ chức sự kiện 5 sao Thông tin về chúng tôi Dịch vụ sinh nhật bé trai Dịch vụ sinh nhật bé gái Sự kiện trọn gói Các tiết mục giải trí Dịch vụ bổ trợ Tiệc cưới sang trọng Dịch vụ khai trương Tư vấn tổ chức sự kiện Hình ảnh sự kiện Cập nhật tin tức Liên hệ ngay Thuê chú hề chuyên nghiệp Tiệc tất niên cho công ty Trang trí tiệc cuối năm Tiệc tất niên độc đáo Sinh nhật bé Hải Đăng Sinh nhật đáng yêu bé Khánh Vân Sinh nhật sang trọng Bích Ngân Tiệc sinh nhật bé Thanh Trang Dịch vụ ông già Noel Xiếc thú vui nhộn Biểu diễn xiếc quay đĩa Dịch vụ tổ chức tiệc uy tín Khám phá dịch vụ của chúng tôi Tiệc sinh nhật cho bé trai Trang trí tiệc cho bé gái Gói sự kiện chuyên nghiệp Chương trình giải trí hấp dẫn Dịch vụ hỗ trợ sự kiện Trang trí tiệc cưới đẹp Khởi đầu thành công với khai trương Chuyên gia tư vấn sự kiện Xem ảnh các sự kiện đẹp Tin mới về sự kiện Kết nối với đội ngũ chuyên gia Chú hề vui nhộn cho tiệc sinh nhật Ý tưởng tiệc cuối năm Tất niên độc đáo Trang trí tiệc hiện đại Tổ chức sinh nhật cho Hải Đăng Sinh nhật độc quyền Khánh Vân Phong cách tiệc Bích Ngân Trang trí tiệc bé Thanh Trang Thuê dịch vụ ông già Noel chuyên nghiệp Xem xiếc khỉ đặc sắc Xiếc quay đĩa thú vị

Filed under: softwareengineering - @ 22:00

Thẻ: logging

Thiết kế website giá rẻ

Danh mục

What are some patterns and anti-patterns of application logging? [closed]