[nylug-workshop] Log analyzer (Re: Regular meetings of the Python workshop)

Peter C. Norton spacey-nylug-workshop at lenin.net
Sun Feb 25 04:43:38 EST 2007


On Wed, Feb 14, 2007 at 01:48:54PM -0500, Yusuke Shinyama wrote:
> Okay, I just put this up.  See
> http://www.unixuser.org/~euske/python/logweeder/
> 
> I know the documentation is still crappy. 
> Please feel free to point me out any weird English errors! ;)

I had a fun time looking at this. Most of the techniques in analyzing
the differences is new to me, but I thnk I've got a feel for how it
works.

Attached I've got a patch that changes a bit of how this works:

bugs:
1) I think the third element of each output patterrn line can run into
a situation where the string to be eval()'d by match.py can get
through with unescaped single quotes, or at least this happened to me
while I was fooling around with it, and I don't see the original code
protecting from that. Added a call to repr() to fix that.


changes for consideration:
1) Added a configuration file. this is mostly for fun, but I think
it's a useful way to experiment
2) Created a default (configurable) pattern file that makes patterns
persistant and automatically handled within the program.  The names of
the patterns get maintained, and new can be added to old.
3) When learning a new logfile, existing patterns are used to filter
out unwatned elements. I think that this creates a substantial
improvement in the speed (the following is for an about 1000 line
syslog messages file - very repetative):

spacey at brick:~/src/logweeder-0.1$ time ./learn.py /var/log/messages > patterns
[...]
real    2m47.487s
user    2m30.941s
sys     0m1.400s
spacey at brick:~/src/logweeder-0.1$ grep 'type-' patterns | wc -l
160

spacey at brick:~/src/logweeder-0.1-pn$ rm patterns 
spacey at brick:~/src/logweeder-0.1-pn$ time for i in $(seq 1 10) ; do \
  ./learn.py -s 100 /var/log/messages; done
real    0m12.247s
user    0m10.597s
sys     0m0.264s
spacey at brick:~/src/logweeder-0.1-pn$ wc -l patterns 
35 patterns

The way that the patterns are created, I thought that the new version
sould be exposed to as man if not more strings from which to create
patterns due to the filtering step, but I may be wrong:

spacey at brick:~/src/logweeder-0.1$ ./match.py patterns
/var/log/messages | egrep -v 'type-'
unknown: Feb 21 23:44:34 brick gconfd (root-4802): GConf server is not
in use, shutting down.
unknown: Feb 22 07:51:00 brick gconfd (spacey-4525): SIGHUP received,
reloading all databases


spacey at brick:~/src/logweeder-0.1-pn$ ./match.py patterns
/var/log/messages | egrep -v 'type-'
unknown: Feb 21 23:44:34 brick gconfd (root-4802): GConf server is not
in use, shutting down.
unknown: Feb 22 07:51:00 brick gconfd (spacey-4525): SIGHUP received,
reloading all databases
unknown: Feb 24 13:43:22 brick kernel: [ 1189.708190] ohci_hcd
0000:00:13.1: wakeup
unknown: Feb 24 13:43:22 brick kernel: [ 1189.878617] usb 2-3: new low
speed USB device using ohci_hcd and address 2
unknown: Feb 24 13:43:22 brick kernel: [ 1189.971468] usb 2-3:
configuration #1 chosen from 1 choice
unknown: Feb 24 13:43:23 brick kernel: [ 1190.083431] input: Logitech
USB Trackball as /class/input/input3
unknown: Feb 24 13:43:23 brick kernel: [ 1190.083976] input: USB HID
v1.10 Mouse [Logitech USB Trackball] on usb-0000:00:13.1-3
unknown: Feb 24 13:43:23 brick kernel: [ 1190.084351]
drivers/usb/input/hid-core.c: v2.6:USB HID core driver

I think this may be a bug that I created, though. It doesn't seem to
want to create patterns for things it's seeing only once, which is
still desired. I'm going to work on that later, when I've slept.

In exchange for the improved time to learn patterns, it's likely that
I'm breaking something about the behavior of the cluster class
somewhat and possibly allowing for patterns to be created for related
entires that aren't going to be localized in the same pattern, but I
suspect that in the case of log files it's unlikely for this to
happen.

but something is weird here. If I look at the first few patterns in
the original,  get things like this that I just don't think that I should:

('type-0', 212, '^brick\\ kernel.*\\:.*\\ \\[.*\\ .*\\ .*\\ .*\\ ')
('type-1', 205, '^brick\\ kernel.*\\:.*\\ \\[.*\\ .*\\ .*\\ .*\\ ')
('type-2', 198, '^brick\\ kernel.*\\:.*\\ \\[.*\\ .*\\ .*\\ .*\\ ')
('type-3', 187, '^brick\\ kernel.*\\:.*\\ \\[.*\\ .*\\ .*\\ .*\\ ')
('type-4', 187, '^brick\\ kernel.*\\:.*\\ \\[.*\\ .*\\ .*\\ .*\\ ')
('type-5', 187, '^brick\\ kernel.*\\:.*\\ \\[.*\\ .*\\ .*\\ .*\\ ')
('type-6', 151, '^brick\\ kernel.*\\:.*\\ \\[.*\\ .*\\ .*\\ .*\\ ')
('type-7', 128, '^brick\\ kernel\\:\\ \\[.*\\ .*\\ .*\\ .*\\ ')
('type-8', 107, '^brick\\ kernel\\:\\ \\[.*\\ .*\\ .*\\ .*\\ ')

I think these patterns should be impossible since 0-6 appear to be
identical, etc. However, thatt's how it is right now.

things I'm not sure will fly:

1) The way I handle the pattern file, the sample data gets thrown
out. this isn't good, but I'm too tired now.
2) There's some debuging gunk lying around. comment out the
sys.sterr.writes you don't like.

-Peter
(see attached for patches)


-- 
The 5 year plan:
In five years we'll make up another plan.
Or just re-use this one.

-------------- next part --------------
--- logweeder-0.1/learn.py	2007-02-14 13:30:48.000000000 -0500
+++ logweeder-0.1-pn/learn.py	2007-02-25 03:37:22.000000000 -0500
@@ -27,10 +27,12 @@
 #
 
 import sys, re
+import os
+import base64
+
 stdout = sys.stdout
 stderr = sys.stderr
 
-
 # find_lcs
 def find_lcs(s1, s2):
   # length table: every element is set to zero.
@@ -69,6 +71,9 @@
 # find_lcs_len:
 # faster version of find_lcs, only for calculating similarities.
 def find_lcs_len(s1, s2):
+  '''
+Compares s1 and s2, and returns the number of common elements in the two arrays.
+  '''
   m = [ [ 0 for x in s2 ] for y in s1 ]
   for p1 in range(len(s1)):
     for p2 in range(len(s2)):
@@ -104,11 +109,42 @@
 
 
 # Read the log entries.
-def read_log(files, charskip=16, nsamples=100):
+# XXX Why is nsamples defaulted to 100 here and 1000 in main()?
+def read_log(files, charskip=16, nsamples=0, patterns = None):
+  '''
+Reads in all logfiles listed in the first arg, <files>. It strips out
+<charskip> number of characters at the beginning of each line (default
+16, removing the syslog timestamp).
+
+Due to the O(n^2) number of comparisons, the number of lines input for
+the sample is reduced.  A randomized subset <nsamples> of the lines
+are returned to be processed. If nsamples <= 0, all entries are read.
+
+If there is already a list of patterns, then they are excldued.
+  '''
   import fileinput
   stderr.write('Reading...'); stderr.flush()
+  stderr.write("%d %s\n" % (charskip, files)); stderr.flush()
   entries = list( line[charskip:].strip() for line in fileinput.input(files) )
-  if nsamples:
+  if patterns != None:
+    todel = []
+    for count in range(len(entries)):
+      e = entries[count]
+      for exp in patterns:
+        if exp.search(e):
+          todel.append(count)
+          continue
+    if len(todel) > 0:
+      u_list = [] # Uniq'd list
+      [u_list.append(num) for num in todel if num not in u_list]
+      u_list.reverse()
+      stderr.write("Removing %d entries" % len(u_list)); stderr.flush()
+      for num in u_list:
+        del(entries[num])
+          
+
+  # 0 or less means everything.
+  if nsamples > 0:
     import random
     random.shuffle(entries)
     entries = entries[:nsamples]
@@ -118,8 +154,10 @@
 
 ##  Cluster of log entries
 ##
+##
+## XXX: supporting a stable notion of a cluster would help
+## eg. type-1 would survive multiple generates of input.
 class Cluster:
-  
   def __init__(self):
     self.strs = {}
     return
@@ -155,6 +193,7 @@
           unmatched.append(''.join(s[i0+1:i1]))
       if unmatched:
         r.append(choose_wildcard(unmatched))
+
       r.append(re.escape(s1[i]))
       idx0 = idx1
     return r
@@ -194,8 +233,31 @@
   return set(clusters.itervalues())
 
 
+def get_exclusions(filter_file):
+  """
+Filters out the regular expressions in the configured pattern file
+  """
+  try:
+    f           = file(filter_file)
+  except IOError:
+    return []
+
+  patterns      = []
+  for line in f:
+    line = line.strip()
+    if not line or line.startswith('#'):
+      pass
+    elif line.startswith('('):
+      (name, freq, pat) = eval(line)
+      patterns.append(re.compile(pat))
+  f.close()
+
+  stderr.write('Got %d patterns to pre-exclude ' % len(patterns) )
+  return patterns
+
 # main
 def main(argv):
+  import weedconf
   import getopt
   def usage():
     print 'usage: %s [-c charskip] [-s num_samples] [-t similarity_threshold] [file ...]' % argv[0]
@@ -204,25 +266,56 @@
     (opts, args) = getopt.getopt(argv[1:], 'dc:s:t:')
   except getopt.GetoptError:
     return usage()
-  debug = 0
-  nsamples = 1000
-  charskip = 16
-  threshold = 0.7
+
+  cf = weedconf.cf()
+  
+  debug = int(cf.get("logweeder", "debug"))
+  nsamples = int(cf.get("logweeder", "nsamples"))
+  charskip = int(cf.get("logweeder", "charskip"))
+  threshold = float(cf.get("logweeder", "threshold"))
+  p_file = cf.get("logweeder", "pattern_file")
   for (k, v) in opts:
     if k == '-d': debug += 1
     elif k == '-c': charskip = int(v)
     elif k == '-s': nsamples = int(v)
     elif k == '-t': threshold = float(v)
-  entries = read_log(args, charskip, nsamples)
+  patterns = get_exclusions(p_file)
+  entries = read_log(args, charskip, nsamples, patterns)
   clusters = cluster_log(entries, threshold)
   stderr.write('Generating regexp patterns.'); stderr.flush()
+
+  def make_pat_str(count, length, re_text):
+    # Without repr, creates strings that can break parsing by
+    # eval()
+    return "('type-%d',%d,%s)\n" % (count, length, repr("".join(re_text)))
+
+  # Re-generate the pattern file.
+  # I've got a feeling I've just fragmented the cluster notion, which is sub-optimal
+  # I'll figure that out next. -PN
+  tmpname = "".join([p_file, ".new"])
+  p = file(tmpname, "w")
+
+  # prime the pattern file with the existing patterns. I'm not sure that the
+  # frequency/match threshold is recorded, or how it's supposed to be used in match,
+  # so for now I'm fudging it to 1.
+  print patterns
+  for count in range(len(patterns)):
+    p_str = make_pat_str(count, 1, patterns[count].pattern)
+    print "Writing: %s" % p_str
+    p.write(p_str)
+
   for (i,c) in enumerate(sorted(clusters, key=len, reverse=True)):
-    print ('type-%d' % i, len(c), ''.join(c.fuse()))
+    p_str = make_pat_str(i + len(patterns), len(c), "".join(c.fuse()))
+    p.write(p_str)
+
     for (s,score) in c.strs.iteritems():
       print '#\t(%.2f) %s' % (score, ''.join(s))
-    print
     stderr.write('.'); stderr.flush()
   stderr.write('\n'); stderr.flush()
+  p.flush()
+  p.close()
+  os.rename(tmpname, p_file)
+
   return 0
 
 if __name__ == '__main__': sys.exit(main(sys.argv))
--- logweeder-0.1/match.py	2007-02-14 13:31:08.000000000 -0500
+++ logweeder-0.1-pn/match.py	2007-02-25 03:43:37.000000000 -0500
@@ -45,6 +45,7 @@
 
 
 def main(argv):
+  import weedconf
   import getopt
   def usage():
     print 'usage: %s [-c charskip] [-t freq_threshold] patfile [file ...]' % argv[0]
@@ -53,11 +54,17 @@
     (opts, args) = getopt.getopt(argv[1:], 'c:t:')
   except getopt.GetoptError:
     return usage()
-  charskip = 16
-  threshold = 0
+
+  cf = weedconf.cf()
+
+  debug = int(cf.get("logweeder", "debug"))
+  nsamples = int(cf.get("logweeder", "nsamples"))
+  charskip = int(cf.get("logweeder", "charskip"))
+  match_threshold = float(cf.get("logweeder", "match_threshold"))
+
   for (k, v) in opts:
     if k == '-c': charskip = int(v)
-    elif k == '-t': threshold = int(v)
+    elif k == '-t': match_threshold = int(v)
   if not args: return usage()
   
   patfile = args.pop(0)
@@ -70,7 +77,7 @@
       pass
     elif line.startswith('('):
       (name, freq, pat) = eval(line)
-      if threshold <= freq:
+      if match_threshold <= freq:
         pats.append((re.compile(pat), name))
     else:
       print >>stderr, 'Invalid line: %r' % line
--- /dev/null	2007-02-07 02:09:32.000000000 -0500
+++ logweeder-0.1-pn/weed.conf	2007-02-25 00:47:44.000000000 -0500
@@ -0,0 +1,9 @@
+[logweeder]
+charskip:		16
+conffile:		weed.conf
+debug:			0
+nsamples:		1000
+threshold:		0.7
+match_threshold:	0
+pattern_file:		patterns
+
--- /dev/null	2007-02-07 02:09:32.000000000 -0500
+++ logweeder-0.1-pn/weedconf.py	2007-02-25 00:01:00.000000000 -0500
@@ -0,0 +1,17 @@
+def cf(cfile="weed.conf"):
+    
+    import ConfigParser
+    import os
+
+    defaults = {
+        "charskip":16,
+        "conf_file":"weed.conf",
+        "debug":0,
+        }
+    cp = ConfigParser.ConfigParser(defaults)
+    cp.read(os.path.expanduser(cfile))
+    return cp
+
+if __name__ == '__main__':
+    conf = weedconf()
+    print conf


More information about the nylug-workshop mailing list