algorithm - Reservoir Sampling on large Streams -
i trying implement reservoir sampling algorithm using java. have n streams of data ( readings sensors arriving @ sink node ) of unknown size. sake of simplicity lets assume have 1 stream of unknown size.
so 1 of reservoir sampling algorithms suggests create reservoir of size reservoirsize. lets 5. first 5 readings get, store them in reservoir. ok. more , more readings, each reading generate random number 0 reading number , if random number smaller reservoirsize store reading in reservoir[randomnumber].
so lets have reservoirsize = 5 , got 10th reading. generate random number 0 10 , if number smaller 5 wll store reading there random number points. lets random number 3 store reading number 10 in reservoir[3].
public void sample (vector pool, double measurement, int streamindex) { if (streamindex < reservoirsize){ pool.addelement(double.tostring(measurement)); } else if ((randomindex=(int)rannum.nextint((streamindex+1)))<reservoirsize) { pool.setelementat(double.tostring(measurement), randomindex); } }
the problem code once streamindex gets big enough ( above 4.000 example ) sample readings. , make sense because propability of generating random number 0 4000 smaller 5 significanly smaller propability generate random number 0 lets 100, smaller 5.
i implemented algorthmr vitters paper , way described here:
gregable reservoirsampling
but implementations have same problem. larger stream gets smaller sampling frequency becomes. sampling rate of 0.5s, 1 hour after start sampling (which means 7000 readings have been forwarded sink node ), change in measured quantity not detected half hour i.e reading indicating change discarded reservoir.
algorthmr implemantation
public rsalgorithmr() { this.currentpool = null; this.randomstoreatindex = 0; this.randomindex = 0; this.rannum = new random(); } public void sample (llnode cnode, double measurement) { int streamindex = cnode.getstreamindex(); int storeatindex =cnode.getstoreatindex(); if (streamindex < reservoirsize) { cnode.data.addelement(double.tostring(measurement)); if (streamindex == ( reservoirsize - 1) ) { randomstoreatindex = (int)rannum.nextint(reservoirsize); cnode.setstoreatindex((int)randomstoreatindex); } } else { if (storeatindex == streamindex) { randomindex=(int)rannum.nextint(reservoirsize); cnode.data.setelementat(double.tostring(measurement), randomindex); randomstoreatindex = (int)rannum.nextint(streamindex - reservoirsize) + reservoirsize; cnode.setstoreatindex(randomstoreatindex); system.out.println("index:: "+streamindex); system.out.println("randomindex:: " + randomindex); } } cnode.setstreamindex(); };
gregable implementation
public reservoirsampler() { this.currentpool = null; this.randomindex = 0; this.ranprop = new random(); this.ranind = new random(); } public void sample (llnode currentspot, double humidityread, double temperatureread, int streamindex) { double acceptancepropability = (double)reservoirsize/streamindex; if (streamindex < reservoirsize){ currentspot.humiditydata.addelement(double.tostring(humidityread)); currentspot.temperaturedata.addelement(double.tostring(temperatureread)); } else { ranprop.setseed(system.currenttimemillis()); randompropability=(double)ranprop.nextdouble(); if ( randompropability < acceptancepropability){ ranind.setseed(system.currenttimemillis()); randomindex=(int)ranind.nextint((reservoirsize)); currentspot.humiditydata.setelementat(double.tostring(humidityread),randomindex); currentspot.temperaturedata.setelementat(double.tostring(temperatureread),randomindex); } } }
is normal behaviour of algorthm or missing here? , if normal behaviour there way make work more "accuratelly"?
this normal behavior of algorithm r (see knuth's "the art of computer programming" 3.4.2)
however, better algorithms available:
- algorithms x,y,z: see "random sampling reservoir" [jeferey scott vitter, 1985]
- algorithms k,l,m: see "reservoir-sampling algorithms of time complexity o(n(1+log(n)-log(n)))" [kim-hung li ,1994]
in contrast algorithm r, these algorithms draw number of stream elements skip @ each stage, less random numbers generated, long streams.
re "accuracy": in algorithms (r,x,y,z,k,l,m) each element in input stream equally in sample. can proven mathematically , demonstrated empirically running same algorithm on same input stream large number of times , measuring frequency each element sampled (you'll have use prng, e.g. mersenne twister). major difference between algorithms amount of random numbers generated.
all algorithms relatively simple implement , test. algorithm l, though not efficient one, compact , straightforward implement, , still more efficient algorithm r.
Comments
Post a Comment