[nylug-talk] Send a set really fast
Yusuke Shinyama
yusuke at cs.nyu.edu
Fri May 4 18:51:22 EDT 2007
On Fri, 4 May 2007 10:53:19 -0500, Michael Bacarella <mbac at netgraft.com> wrote:
>
> I'm having a performance problem pulling aset and
> bset into the Python process on z. Bandwidth
> is not the bottleneck (if it were it would be an easy solution).
> It simply takes a significant amount of CPU time to take the
> string '1,2,3,...N' recv()'d from the socket and call
> split(',') on it. On our hosts it takes whole seconds with
> N > 1,000,000
>
> My inclination is to look for the Python equivalent of
> what I would do in C, write() and read() an array
> over the socket since all hosts run the same architecture.
>
> I thought (c)pickle/unpickle might be the equivalent but
> they aren't, they're approximately as slow as split.
>
> Is anything equivalent?
You may want to use a module called "array." It is an array for
only predefined types (int, long, double, etc.) and can be used
just as a normal array. The nice thing is that it takes much less space
and can be converted to a byte string very quickly.
>>> import array
>>> a = array.array('I', range(1000000))
>>> bytes = a.tostring()
>>> sock.send(bytes)
If you care the bandwith, you can even compress it:
>>> import zlib
>>> sock.send(zlib.compress(bytes))
Here is a decoder:
>>> import array
>>> bytes = sock.recv(n)
>>> a = array.array('I')
>>> a.fromstring(bytes)
Note that the returned byte string is in architecture-dependent order,
though this isn't a problem in your case.
Also, if these numbers are already sorted within array, taking its
intersection can be done more quickly than using set.intersect(),
by using binary searching. Here is the code I use:
def intersect(refs0):
refs = sorted(refs0, key=len)
ref0 = refs.pop(0)
poss = [ 0 for r in refs ]
def seek(k):
j = 0
for p0 in poss:
r = refs[j]
p1 = len(r)-1
# start=r[p0], end=r[p1]
while 1:
if p1 < p0: break
p = (p0+p1)/2
# p0 <= p <= p1
m = r[p] # m:median
if k == m:
# found
poss[j] = p+1
break
if k < m:
# k < m : [ m, ..., k ] : go left
p1 = p-1
else:
# m < k : [ k, ..., m ] : go right
poss[j] = p+1
p0 = p+1
if p1 < p0: return False
j += 1
return True
return [ k for k in ref0 if seek(k) ]
A=[0,1,2,3,4,5]
B=[1,2,3,5]
C=[1,3,4]
print intersect([A,B,C])
This is effective especially if there're multiple sets to
intersect and their number of elements is rather different to each
other. (e.g. set A, B and C have 1,000,000 elements, 10,000
elements, 100 elements respectively.)
Yusuke
More information about the nylug-talk
mailing list