[nylug-workshop] [Reminder] Regular meetings of the Python workshop @ Tue Feb 13 18:00 - 20:00 (2 hrs)

Yusuke Shinyama yusuke at cs.nyu.edu
Tue Feb 13 14:57:28 EST 2007


On Tue, 13 Feb 2007 07:43:03 -0800, "Peter C. Norton" <spacey-nylug-workshop at lenin.net> wrote:
> On Mon, Feb 12, 2007 at 09:14:14PM -0500, Yusuke Shinyama wrote:
> >   '^mango rpc\.mountd:\ authenticated [a-zA-Z_]* request from [a-zA-Z_]*\.cs\.nyu\.edu:[0-9]* for .*/[a-zA-Z_]* \(.*/[a-zA-Z_]*\)'
> >   '^mango kernel: Packet log: input REJECT eth1 PROTO=[0-9]* [0-9]*\.[0-9]*\.[0-9]*\.[0-9]*:[0-9]* 128\.122\.140\.61:[0-9]* L=[0-9]* S=0x00 I=[0-9]* F=0x4000 T=.* \(#[0-9]*\)'
> >   '^mango rpc\.mountd: export request from 128\.122\.140\.70'
> 
> That's awesome! I'd love to know how you categorize similarities!

The idea is similar to diff. First calculate the edit distance
between all possible pairs of strings, and gather all the pairs
whose similarity is more than a certain threshold. This is a
common technique called clustering. I was using it for analyzing
HTML pages for my research, but thought this could also be used
for automatically classifying logs.
(fyi: http://www.unixuser.org/~euske/python/webstemmer/ )

The biggest problem for now is speed. Because we compare arbitrary
pairs of strings, it could take O(n^2) times where n is the number of
log entries to examine. Also, each comparison of two strings takes
len(s1)*len(s2) computation, so it can be quite slow. Of course
once regexp patterns are produced, comparing logs with regexps is
pretty fast. But I'm not sure yet how this is prectically useful.
Anyway, we will see.

Yusuke


More information about the nylug-workshop mailing list