hadoop - how to order and limit after group by in pig latin without crashing the job -
many times interested in taking top or bottom of set (after order by) has been grouped on keys before ordering.
a = foreach data generate x,y,z; b = distinct a; c = group b (x,y) parallel 11; d = foreach c { orderd = order b z desc; first_rec = limit orderd 1; generate flatten(first_rec) (x,y,z); }; store d 'xyz' using pigstorage();
the foreach generate above takes 'forever' finish , getting killed after 12 hours or so. mapreduce job responsible spawned 3maps, 4reducers 1 reducer remains processing entire day , kills off due error 6017, file error.
is there way solve or better way of doing want ?
what volume of data involved ? sure datanode(s) big enough handle amount of data ?
if so, instead of order, go max. way, 1 tuple have kept in memory , sufficient because group contains other needed information:
d = foreach c generate group, max (b.z);
Comments
Post a Comment