hadoop - how to order and limit after group by in pig latin without crashing the job -

- July 15, 2011

many times interested in taking top or bottom of set (after order by) has been grouped on keys before ordering.

a = foreach data      generate x,y,z;  b = distinct a; c = group b (x,y) parallel 11; d = foreach c {               orderd = order b z desc;               first_rec = limit orderd 1;               generate flatten(first_rec) (x,y,z);         };  store d 'xyz' using pigstorage();

the foreach generate above takes 'forever' finish , getting killed after 12 hours or so. mapreduce job responsible spawned 3maps, 4reducers 1 reducer remains processing entire day , kills off due error 6017, file error.

is there way solve or better way of doing want ?

what volume of data involved ? sure datanode(s) big enough handle amount of data ?

if so, instead of order, go max. way, 1 tuple have kept in memory , sufficient because group contains other needed information:

d = foreach c generate group, max (b.z);

Search This Blog

Dil

hadoop - how to order and limit after group by in pig latin without crashing the job -

Comments

Post a Comment

Popular posts from this blog

c# - Store DBContext Log in other EF table -

Nuget pack csproj using nuspec -

javascript - Karma not able to start PhantomJS on Windows - Error: spawn UNKNOWN -