apache spark - Pyspark VS native python speed -
i running spark on ubuntu vm (4gb, 2 cores). doing simple test of word count. comparing simple python dict()
counter. find pyspark 5x slower (takes more time).
is because of initialisation or need tune parameter?
import sys, os sys.path.append('/home/dirk/spark-1.4.1-bin-hadoop2.6/python') os.environ['spark_home']='/home/dirk/spark-1.4.1-bin-hadoop2.6/' os.environ['spark_worker_cores']='2' os.environ['spark_worker_memory']='2g' import time import py4j pyspark import sparkcontext, sparkconf operator import add conf=(sparkconf().setmaster('local').setappname('app')) sc=sparkcontext(conf=conf) f='big3.txt' s=time.time() lines=sc.textfile(f) counts=lines.flatmap(lambda x:x.split(' ')).map(lambda x: (x,1)).reducebykey(add) output=counts.collect() print len(output) (word, count) in output: if count>100: print word, count sc.stop() print 'elapsed',time.time()-s s=time.time() f1=open(f,'r') freq={} line in f1: words=line.split(' ') w in words: if w not in freq: freq[w]=0 freq[w]+=1 f1.close() print len(freq) (w,c) in freq.iteritems(): if c>100: print w,c print 'elapsed',time.time()-s
as mentioned in comments, spark isn't ideal such short-lived jobs. you'd better run on hundreds of files, each being @ least few megabytes in size. might saturate hdd reading speed before cpu gets 100% utilized, ideally you'd have @ least few computers ssds real advantage spark :)
Comments
Post a Comment