apache spark - Pyspark VS native python speed -


i running spark on ubuntu vm (4gb, 2 cores). doing simple test of word count. comparing simple python dict() counter. find pyspark 5x slower (takes more time).

is because of initialisation or need tune parameter?

import sys, os  sys.path.append('/home/dirk/spark-1.4.1-bin-hadoop2.6/python') os.environ['spark_home']='/home/dirk/spark-1.4.1-bin-hadoop2.6/' os.environ['spark_worker_cores']='2' os.environ['spark_worker_memory']='2g'  import time import py4j pyspark import sparkcontext, sparkconf operator import add  conf=(sparkconf().setmaster('local').setappname('app')) sc=sparkcontext(conf=conf)  f='big3.txt'  s=time.time() lines=sc.textfile(f) counts=lines.flatmap(lambda x:x.split(' ')).map(lambda x: (x,1)).reducebykey(add) output=counts.collect() print len(output) (word, count) in output:     if count>100: print word, count sc.stop() print 'elapsed',time.time()-s  s=time.time() f1=open(f,'r') freq={} line in f1:     words=line.split(' ')     w in words:         if w not in freq:             freq[w]=0         freq[w]+=1 f1.close() print len(freq) (w,c) in freq.iteritems():     if c>100: print w,c print 'elapsed',time.time()-s 

as mentioned in comments, spark isn't ideal such short-lived jobs. you'd better run on hundreds of files, each being @ least few megabytes in size. might saturate hdd reading speed before cpu gets 100% utilized, ideally you'd have @ least few computers ssds real advantage spark :)


Comments

Popular posts from this blog

javascript - Karma not able to start PhantomJS on Windows - Error: spawn UNKNOWN -

Nuget pack csproj using nuspec -

c# - Display ASPX Popup control in RowDeleteing Event (ASPX Gridview) -