scala - Spark - LinearRegressionWithSGD on Coursera Machine Learning by Stanford University samples -

software version: apache spark v1.3

context: i've been trying "translate" octave/matlab code scala on apache spark. more precisely, work on ex1data1.txt , ex1data2.txt coursera practical part ex1. i've made such translation julia lang (it went smoothly) , i've been struggling spark...without success.

problem: performance of implementation on spark poor. cannot works correctly. that's why ex1data1.txt added polynomial feature, , worked with: theta0 using setintercept(true) , non-normalized column of 1.0 values(in case set intercept false). receive silly results. so, 've decided start working ex1data2.txt. below can find code , expected result. of course spark result wrong.

did have similar experience? grateful help.

the scala code exd1data2.txt:

import org.apache.spark.mllib.feature.standardscaler import org.apache.spark.mllib.linalg.vectors import org.apache.spark.mllib.optimization.squaredl2updater import org.apache.spark.mllib.regression.{linearregressionmodel, linearregressionwithsgd, labeledpoint} import org.apache.spark.{sparkcontext, sparkconf}   object mlibonex1data2 extends app {   val conf = new sparkconf()   conf.set("", "coursera ex1data2.txt test")    val sc = new sparkcontext(conf)   val input = sc.textfile("hdfs:///ex1data2.txt")    val traindata = { line =>     val parts = line.split(',')     val y = parts(2).todouble     val features = vectors.dense(parts(0).todouble, parts(1).todouble)     println(s"x = $features y = $y")     labeledpoint(y, features)   }.cache()    // building model   val numiterations = 1500   val alpha = 0.01    // scale features   val scaler = new standardscaler(withmean = true, withstd = true)     .fit( => x.features))   val scaledtraindata ={ td =>     val normfeatures = scaler.transform(td.features)     println(s"normalized features = $normfeatures")     labeledpoint(td.label, normfeatures)   }.cache()    val tsize = scaledtraindata.count()   println(s"training set size $tsize")     val alg = new linearregressionwithsgd().setintercept(true)   alg.optimizer     .setnumiterations(numiterations)     .setstepsize(alpha)     .setupdater(new squaredl2updater)     .setregparam(0.0)  //regularization - off    val model =    println(s"theta $model.weights")    val total1 = model.predict(scaler.transform(vectors.dense(1650, 3)))    println(s"estimate price of 1650 sq-ft, 3 br house = $total1 dollars") //it should give ~ $289314.620338    // evaluate model on training examples , compute training error   val valuesandpreds = { point =>     val prediction = model.predict(point.features)     (point.label, prediction)   }   val mse = (({case(v, p) => math.pow((v - p), 2)}.mean()) / 2)   println("training mean squared error = " + mse)      // save , load model   val trysaveandload = util.try(, "mymodelpath"))     .flatmap { _ => util.try(linearregressionmodel.load(sc, "mymodelpath")) }     .getorelse(-1)    println(s"trysaveandload result $trysaveandload") } 

stdout result is:

training set size 47

theta (weights=[52090.291641275864,19342.034885388926], intercept=181295.93717032953).weights

estimate price of 1650 sq-ft, 3 br house = 153983.5541846754 dollars

training mean squared error = 1.5876093757127676e10

trysaveandload result -1

well, after digging believe there nothing here. first saved content of valuesandpreds text file:{    case {x, y} => s"$x,$y"}.repartition(1).saveastextfile("results.txt")' 

rest of code written in r.

first lets create model using closed form solution:

# load data df <- read.csv('results.txt/ex1data2.txt', header=false) # scale features df[, 1:2] <- apply(df[, 1:2], 2, scale) # build linear model  model <- lm(v3 ~ ., df) 

for reference:

> summary(model)  call: lm(formula = v3 ~ ., data = df)  residuals:     min      1q  median      3q     max  -130582  -43636  -10829   43698  198147   coefficients:             estimate std. error t value pr(>|t|)     (intercept)   340413       9637  35.323  < 2e-16 *** v1            110631      11758   9.409 4.22e-12 *** v2             -6650      11758  -0.566    0.575     --- signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  residual standard error: 66070 on 44 degrees of freedom multiple r-squared:  0.7329,    adjusted r-squared:  0.7208  f-statistic: 60.38 on 2 , 44 df,  p-value: 2.428e-13 

now prediction:

closedformprediction <- predict(model, df) closedformrmse <- sqrt(mean((closedformprediction - df$v3)**2)) plot(    closedformprediction, df$v3,    ylab="actual", xlab="predicted",    main=paste("closed form, rmse: ", round(closedformrmse, 3))) 

enter image description here'

now can compare above sgd results:

sgd <- read.csv('results.txt/part-00000', header=false) sgdrmse <- sqrt(mean(sgd$v2 - sgd$v1)**2)  plot(    sgd$v2, sgd$v1, ylab="actual",    xlab="predicted", main=paste("sgd, rmse: ", round(sgdrmse, 3))) 

enter image description here

finally lets compare both:

plot(    sgd$v2, closedformprediction,    xlab="sgd", ylab="closed form", main="sgd vs closed form") 

enter image description here

so, result not perfect nothing seems off here.


