[nylug-workshop] Log analyzer (Re: Regular meetings of the Python workshop)
Yusuke Shinyama
yusuke at cs.nyu.edu
Wed Feb 14 13:48:54 EST 2007
Okay, I just put this up. See
http://www.unixuser.org/~euske/python/logweeder/
I know the documentation is still crappy.
Please feel free to point me out any weird English errors! ;)
On Tue, 13 Feb 2007 15:41:50 -0800, "Peter C. Norton" <spacey-nylug-workshop at lenin.net> wrote:
> On Tue, Feb 13, 2007 at 02:57:28PM -0500, Yusuke Shinyama wrote:
> > The biggest problem for now is speed. Because we compare arbitrary
> > pairs of strings, it could take O(n^2) times where n is the number of
> > log entries to examine. Also, each comparison of two strings takes
> > len(s1)*len(s2) computation, so it can be quite slow. Of course
> > once regexp patterns are produced, comparing logs with regexps is
> > pretty fast. But I'm not sure yet how this is prectically useful.
> > Anyway, we will see.
>
> It seems like if a regular expression takes O(n) time to match, you
> should compare every string to existing re's first. That could save
> you time if you don't do that already and in logs and in web pages it
> should make a dent into the O(n^2) behavior by substantially reducing
> the length of n. I haven't looked at your code yet, so maybe you're
> talking about O(n^2) after this kind of trivial optimization, in which
> case... hopefully it's just O(n^2) time and not space, too.
The thing is that you want to get the most generalized regexp
patterns among the group, so you cannot craft any pattern until
figuring out all the similar strings, otherwise you will have to
compare multiple regexp patterns to discard a less general one,
which seems more difficult than comparing strings. Actually the
algorithm is not really O(n^2), because similarity computation and
clustering happen simultaneously and it stops immediately after
finding the "close enough" group rather than the "closest"
group. But still there're other O(n^2) parts in the code, so
improvement in speed is still appreciated... I feel this is more
like C-oriented task rather than Python.
Yusuke
More information about the nylug-workshop
mailing list