[nylug-workshop] Log analyzer (Re: Regular meetings of the Python workshop)
Peter C. Norton
spacey-nylug-workshop at lenin.net
Sun Feb 25 04:43:38 EST 2007
On Wed, Feb 14, 2007 at 01:48:54PM -0500, Yusuke Shinyama wrote:
> Okay, I just put this up. See
> http://www.unixuser.org/~euske/python/logweeder/
>
> I know the documentation is still crappy.
> Please feel free to point me out any weird English errors! ;)
I had a fun time looking at this. Most of the techniques in analyzing
the differences is new to me, but I thnk I've got a feel for how it
works.
Attached I've got a patch that changes a bit of how this works:
bugs:
1) I think the third element of each output patterrn line can run into
a situation where the string to be eval()'d by match.py can get
through with unescaped single quotes, or at least this happened to me
while I was fooling around with it, and I don't see the original code
protecting from that. Added a call to repr() to fix that.
changes for consideration:
1) Added a configuration file. this is mostly for fun, but I think
it's a useful way to experiment
2) Created a default (configurable) pattern file that makes patterns
persistant and automatically handled within the program. The names of
the patterns get maintained, and new can be added to old.
3) When learning a new logfile, existing patterns are used to filter
out unwatned elements. I think that this creates a substantial
improvement in the speed (the following is for an about 1000 line
syslog messages file - very repetative):
spacey at brick:~/src/logweeder-0.1$ time ./learn.py /var/log/messages > patterns
[...]
real 2m47.487s
user 2m30.941s
sys 0m1.400s
spacey at brick:~/src/logweeder-0.1$ grep 'type-' patterns | wc -l
160
spacey at brick:~/src/logweeder-0.1-pn$ rm patterns
spacey at brick:~/src/logweeder-0.1-pn$ time for i in $(seq 1 10) ; do \
./learn.py -s 100 /var/log/messages; done
real 0m12.247s
user 0m10.597s
sys 0m0.264s
spacey at brick:~/src/logweeder-0.1-pn$ wc -l patterns
35 patterns
The way that the patterns are created, I thought that the new version
sould be exposed to as man if not more strings from which to create
patterns due to the filtering step, but I may be wrong:
spacey at brick:~/src/logweeder-0.1$ ./match.py patterns
/var/log/messages | egrep -v 'type-'
unknown: Feb 21 23:44:34 brick gconfd (root-4802): GConf server is not
in use, shutting down.
unknown: Feb 22 07:51:00 brick gconfd (spacey-4525): SIGHUP received,
reloading all databases
spacey at brick:~/src/logweeder-0.1-pn$ ./match.py patterns
/var/log/messages | egrep -v 'type-'
unknown: Feb 21 23:44:34 brick gconfd (root-4802): GConf server is not
in use, shutting down.
unknown: Feb 22 07:51:00 brick gconfd (spacey-4525): SIGHUP received,
reloading all databases
unknown: Feb 24 13:43:22 brick kernel: [ 1189.708190] ohci_hcd
0000:00:13.1: wakeup
unknown: Feb 24 13:43:22 brick kernel: [ 1189.878617] usb 2-3: new low
speed USB device using ohci_hcd and address 2
unknown: Feb 24 13:43:22 brick kernel: [ 1189.971468] usb 2-3:
configuration #1 chosen from 1 choice
unknown: Feb 24 13:43:23 brick kernel: [ 1190.083431] input: Logitech
USB Trackball as /class/input/input3
unknown: Feb 24 13:43:23 brick kernel: [ 1190.083976] input: USB HID
v1.10 Mouse [Logitech USB Trackball] on usb-0000:00:13.1-3
unknown: Feb 24 13:43:23 brick kernel: [ 1190.084351]
drivers/usb/input/hid-core.c: v2.6:USB HID core driver
I think this may be a bug that I created, though. It doesn't seem to
want to create patterns for things it's seeing only once, which is
still desired. I'm going to work on that later, when I've slept.
In exchange for the improved time to learn patterns, it's likely that
I'm breaking something about the behavior of the cluster class
somewhat and possibly allowing for patterns to be created for related
entires that aren't going to be localized in the same pattern, but I
suspect that in the case of log files it's unlikely for this to
happen.
but something is weird here. If I look at the first few patterns in
the original, get things like this that I just don't think that I should:
('type-0', 212, '^brick\\ kernel.*\\:.*\\ \\[.*\\ .*\\ .*\\ .*\\ ')
('type-1', 205, '^brick\\ kernel.*\\:.*\\ \\[.*\\ .*\\ .*\\ .*\\ ')
('type-2', 198, '^brick\\ kernel.*\\:.*\\ \\[.*\\ .*\\ .*\\ .*\\ ')
('type-3', 187, '^brick\\ kernel.*\\:.*\\ \\[.*\\ .*\\ .*\\ .*\\ ')
('type-4', 187, '^brick\\ kernel.*\\:.*\\ \\[.*\\ .*\\ .*\\ .*\\ ')
('type-5', 187, '^brick\\ kernel.*\\:.*\\ \\[.*\\ .*\\ .*\\ .*\\ ')
('type-6', 151, '^brick\\ kernel.*\\:.*\\ \\[.*\\ .*\\ .*\\ .*\\ ')
('type-7', 128, '^brick\\ kernel\\:\\ \\[.*\\ .*\\ .*\\ .*\\ ')
('type-8', 107, '^brick\\ kernel\\:\\ \\[.*\\ .*\\ .*\\ .*\\ ')
I think these patterns should be impossible since 0-6 appear to be
identical, etc. However, thatt's how it is right now.
things I'm not sure will fly:
1) The way I handle the pattern file, the sample data gets thrown
out. this isn't good, but I'm too tired now.
2) There's some debuging gunk lying around. comment out the
sys.sterr.writes you don't like.
-Peter
(see attached for patches)
--
The 5 year plan:
In five years we'll make up another plan.
Or just re-use this one.
-------------- next part --------------
--- logweeder-0.1/learn.py 2007-02-14 13:30:48.000000000 -0500
+++ logweeder-0.1-pn/learn.py 2007-02-25 03:37:22.000000000 -0500
@@ -27,10 +27,12 @@
#
import sys, re
+import os
+import base64
+
stdout = sys.stdout
stderr = sys.stderr
-
# find_lcs
def find_lcs(s1, s2):
# length table: every element is set to zero.
@@ -69,6 +71,9 @@
# find_lcs_len:
# faster version of find_lcs, only for calculating similarities.
def find_lcs_len(s1, s2):
+ '''
+Compares s1 and s2, and returns the number of common elements in the two arrays.
+ '''
m = [ [ 0 for x in s2 ] for y in s1 ]
for p1 in range(len(s1)):
for p2 in range(len(s2)):
@@ -104,11 +109,42 @@
# Read the log entries.
-def read_log(files, charskip=16, nsamples=100):
+# XXX Why is nsamples defaulted to 100 here and 1000 in main()?
+def read_log(files, charskip=16, nsamples=0, patterns = None):
+ '''
+Reads in all logfiles listed in the first arg, <files>. It strips out
+<charskip> number of characters at the beginning of each line (default
+16, removing the syslog timestamp).
+
+Due to the O(n^2) number of comparisons, the number of lines input for
+the sample is reduced. A randomized subset <nsamples> of the lines
+are returned to be processed. If nsamples <= 0, all entries are read.
+
+If there is already a list of patterns, then they are excldued.
+ '''
import fileinput
stderr.write('Reading...'); stderr.flush()
+ stderr.write("%d %s\n" % (charskip, files)); stderr.flush()
entries = list( line[charskip:].strip() for line in fileinput.input(files) )
- if nsamples:
+ if patterns != None:
+ todel = []
+ for count in range(len(entries)):
+ e = entries[count]
+ for exp in patterns:
+ if exp.search(e):
+ todel.append(count)
+ continue
+ if len(todel) > 0:
+ u_list = [] # Uniq'd list
+ [u_list.append(num) for num in todel if num not in u_list]
+ u_list.reverse()
+ stderr.write("Removing %d entries" % len(u_list)); stderr.flush()
+ for num in u_list:
+ del(entries[num])
+
+
+ # 0 or less means everything.
+ if nsamples > 0:
import random
random.shuffle(entries)
entries = entries[:nsamples]
@@ -118,8 +154,10 @@
## Cluster of log entries
##
+##
+## XXX: supporting a stable notion of a cluster would help
+## eg. type-1 would survive multiple generates of input.
class Cluster:
-
def __init__(self):
self.strs = {}
return
@@ -155,6 +193,7 @@
unmatched.append(''.join(s[i0+1:i1]))
if unmatched:
r.append(choose_wildcard(unmatched))
+
r.append(re.escape(s1[i]))
idx0 = idx1
return r
@@ -194,8 +233,31 @@
return set(clusters.itervalues())
+def get_exclusions(filter_file):
+ """
+Filters out the regular expressions in the configured pattern file
+ """
+ try:
+ f = file(filter_file)
+ except IOError:
+ return []
+
+ patterns = []
+ for line in f:
+ line = line.strip()
+ if not line or line.startswith('#'):
+ pass
+ elif line.startswith('('):
+ (name, freq, pat) = eval(line)
+ patterns.append(re.compile(pat))
+ f.close()
+
+ stderr.write('Got %d patterns to pre-exclude ' % len(patterns) )
+ return patterns
+
# main
def main(argv):
+ import weedconf
import getopt
def usage():
print 'usage: %s [-c charskip] [-s num_samples] [-t similarity_threshold] [file ...]' % argv[0]
@@ -204,25 +266,56 @@
(opts, args) = getopt.getopt(argv[1:], 'dc:s:t:')
except getopt.GetoptError:
return usage()
- debug = 0
- nsamples = 1000
- charskip = 16
- threshold = 0.7
+
+ cf = weedconf.cf()
+
+ debug = int(cf.get("logweeder", "debug"))
+ nsamples = int(cf.get("logweeder", "nsamples"))
+ charskip = int(cf.get("logweeder", "charskip"))
+ threshold = float(cf.get("logweeder", "threshold"))
+ p_file = cf.get("logweeder", "pattern_file")
for (k, v) in opts:
if k == '-d': debug += 1
elif k == '-c': charskip = int(v)
elif k == '-s': nsamples = int(v)
elif k == '-t': threshold = float(v)
- entries = read_log(args, charskip, nsamples)
+ patterns = get_exclusions(p_file)
+ entries = read_log(args, charskip, nsamples, patterns)
clusters = cluster_log(entries, threshold)
stderr.write('Generating regexp patterns.'); stderr.flush()
+
+ def make_pat_str(count, length, re_text):
+ # Without repr, creates strings that can break parsing by
+ # eval()
+ return "('type-%d',%d,%s)\n" % (count, length, repr("".join(re_text)))
+
+ # Re-generate the pattern file.
+ # I've got a feeling I've just fragmented the cluster notion, which is sub-optimal
+ # I'll figure that out next. -PN
+ tmpname = "".join([p_file, ".new"])
+ p = file(tmpname, "w")
+
+ # prime the pattern file with the existing patterns. I'm not sure that the
+ # frequency/match threshold is recorded, or how it's supposed to be used in match,
+ # so for now I'm fudging it to 1.
+ print patterns
+ for count in range(len(patterns)):
+ p_str = make_pat_str(count, 1, patterns[count].pattern)
+ print "Writing: %s" % p_str
+ p.write(p_str)
+
for (i,c) in enumerate(sorted(clusters, key=len, reverse=True)):
- print ('type-%d' % i, len(c), ''.join(c.fuse()))
+ p_str = make_pat_str(i + len(patterns), len(c), "".join(c.fuse()))
+ p.write(p_str)
+
for (s,score) in c.strs.iteritems():
print '#\t(%.2f) %s' % (score, ''.join(s))
- print
stderr.write('.'); stderr.flush()
stderr.write('\n'); stderr.flush()
+ p.flush()
+ p.close()
+ os.rename(tmpname, p_file)
+
return 0
if __name__ == '__main__': sys.exit(main(sys.argv))
--- logweeder-0.1/match.py 2007-02-14 13:31:08.000000000 -0500
+++ logweeder-0.1-pn/match.py 2007-02-25 03:43:37.000000000 -0500
@@ -45,6 +45,7 @@
def main(argv):
+ import weedconf
import getopt
def usage():
print 'usage: %s [-c charskip] [-t freq_threshold] patfile [file ...]' % argv[0]
@@ -53,11 +54,17 @@
(opts, args) = getopt.getopt(argv[1:], 'c:t:')
except getopt.GetoptError:
return usage()
- charskip = 16
- threshold = 0
+
+ cf = weedconf.cf()
+
+ debug = int(cf.get("logweeder", "debug"))
+ nsamples = int(cf.get("logweeder", "nsamples"))
+ charskip = int(cf.get("logweeder", "charskip"))
+ match_threshold = float(cf.get("logweeder", "match_threshold"))
+
for (k, v) in opts:
if k == '-c': charskip = int(v)
- elif k == '-t': threshold = int(v)
+ elif k == '-t': match_threshold = int(v)
if not args: return usage()
patfile = args.pop(0)
@@ -70,7 +77,7 @@
pass
elif line.startswith('('):
(name, freq, pat) = eval(line)
- if threshold <= freq:
+ if match_threshold <= freq:
pats.append((re.compile(pat), name))
else:
print >>stderr, 'Invalid line: %r' % line
--- /dev/null 2007-02-07 02:09:32.000000000 -0500
+++ logweeder-0.1-pn/weed.conf 2007-02-25 00:47:44.000000000 -0500
@@ -0,0 +1,9 @@
+[logweeder]
+charskip: 16
+conffile: weed.conf
+debug: 0
+nsamples: 1000
+threshold: 0.7
+match_threshold: 0
+pattern_file: patterns
+
--- /dev/null 2007-02-07 02:09:32.000000000 -0500
+++ logweeder-0.1-pn/weedconf.py 2007-02-25 00:01:00.000000000 -0500
@@ -0,0 +1,17 @@
+def cf(cfile="weed.conf"):
+
+ import ConfigParser
+ import os
+
+ defaults = {
+ "charskip":16,
+ "conf_file":"weed.conf",
+ "debug":0,
+ }
+ cp = ConfigParser.ConfigParser(defaults)
+ cp.read(os.path.expanduser(cfile))
+ return cp
+
+if __name__ == '__main__':
+ conf = weedconf()
+ print conf
More information about the nylug-workshop
mailing list