Much of performance improvement, security, and troubleshooting work is reactive. While there’s no avoiding this fact, implementing a Daily Log Review policy puts IT staff in a more pro-active position, using each day to build a set of rules that alert you to potential problems before they become serious.
How often does your team find, while troubleshooting an issue, that were small warning signs leading up to the incident but no one had noticed, or there had been no context to make it clear that they were seeing a real problem? Hopefully you don’t encounter such situations very often, but it happens. A daily log review policy can help to catch some of these issues earlier before they bog down or, worse, break your servers.
One advantage to allocating time for daily log review into your staff’s routines is that they become more familiar with the day to day workings of your systems. This knowledge in turn makes unusual occurrences stand out even more than they would have before.
In its ideal form, in a daily log review the IT staff look through your infrastructure’s collective logs on a daily basis, building rules to tag events as either harmless or warranting investigation. After looking into suspect events, those system administrators (or whomever is responsible for review) double-check that none of the rules are too broad, generating false matches. Rules that don’t quite hit the mark are adjusted, and then the next day, it starts again.
Given how large an enterprise infrastructure gets, the idea of trying to review everything in a month, let alone a day, might give you hives. It’s up to you to determine how exhaustive or relaxed to approach this process.
Getting Ready
To get the most out of daily log review, it’s best to collect all of your infrastructure’s network, application, and server logs in a single place, or to index them into a tool where they can be processed simultaneously. Pulling everything together allows you to build one set of rules that apply to similar logs across every machine at once and ultimately reduces your workload. Doing so also allows you to troubleshoot issues faster because you can look at all of those logs simultaneously.
Daily Log Review is tool agnostic, but due to the volume of data, software can make the process much more manageable. Depending on your preferences and your staff’s skills, your people might be happiest with shell or Perl scripts. Many aren’t. Solutions such as ArcSight, LogLogic, Splunk, and others allow you to collect or index logs centrally or through a distributed setup. Look for tools that search your logs in real-time, looking for events that match user-created rules and firing off alerts when specific conditions are met. The more flexibility you have to add tags or fields, the better.
Before anyone on your staff gets down to the business of building rules, be sure to set policy on how harmless and problem events will be marked. You might have everyone tag them as ok or not_ok, or harmless and suspicious. Sometimes it’s hard to decide what is really harmless, as some events are perfectly harmless unless they’re in combination with others. For example, one failed SSH login isn’t a big deal, but a large number of them in a short span of time is cause for worry. Consider building a wiki that you and the IT staff can use together to make policy notes as you all grow used to doing review.
Depending on the tool you chose, you can probably build rules to watch for these combined event cases. Even better, if you’re required to do review due to regulations, some tools also offer auditing features so you can prove that your staff is following the necessary protocols.
The Process
Daily Log Review can sound more complicated than it is when spoken about generically, so let us focus on an example. Say that you expect administrators to each spend an hour a day on this process, focusing on the systems for which they’re responsible. The order of tasks should roughly follow this model, adjusting for your preferences on how much time and what level of detail you want to attack the issue.
- Start by searching for the day’s not_ok or suspicious events that are already flagged.
Investigate the suspicious events or call them to the attention of the right person. For example, a message such as this from your web server may warrant investigation into whether upload file sizes through the HTTP POST method need to be increased, with the accompanying rule looking for a collection of terms in a single event such as error PHP POST Content-Length exceeds limit:
[Sat Jul 24 06:08:55 2010] [error] [client ::1] PHP Warning: POST Content-Length of 20556051 bytes exceeds the limit of 8388608 bytes in Unknown on line 0, referer: http://localhost/wordpress/wp-admin/theme-install.php?tab=upload
If the administrator decides false matches are pulling up events that are harmless, she might adjust the rules to not match the ok events or follow whatever protocol you choose to suggest a rule change. For example, someone might have a rule looking for Python errors but the administrator may feel that the following event really doesn’t merit investigation, as it’s fairly minor:
[Sat Jul 24 06:02:03 2010] [error] python_init: Python version mismatch, expected ’2.6′, found ’2.6.4′.
The administrator might suggest adding Python version mismatch as an exclusion to the current rule, and then creating a second rule assigning error python_init version mismatch an ok tag.
- Search for the day’s untagged events (neither ok nor not_ok).
Create rules and tag the untagged events. Here’s where a lot of the heavy lifting occurs in Daily Log Review. This is the spot where you might be most likely to suggest a time limit – or your staff may not get anything else done. Fortunately, since you’re working in a centralized tool across all of your logs at once, each rule created applies across your whole infrastructure.
- Search for the day’s ok events that are already flagged.
Look again for false matches. You don’t want problems slipping by because a rule was too broad and is tagging unforeseen problem events. Either adjust the rules or follow protocol for rule change discussion.
Over time, you and your staff will find a rhythm that works for your organization. Everyone will become more familiar with the daily workings of your infrastructure. People will think to create rules on their own when they see problems so that those problems will flag with an alert or just in the not_ok searches if they recur. There will still be times when you’re trapped in reactive mode, but they’ll be fewer. And you’ll have your handy new tools at your staff’s disposal when the emergencies do occur.
Related Information From Dell.com: Intelligent Infrastructure: The IT You Already Own — But Smarter.



