scala - Spark - LinearRegressionWithSGD on Coursera Machine Learning by Stanford University samples -
software version: apache spark v1.3
context: i've been trying "translate" octave/matlab code scala on apache spark. more precisely, work on ex1data1.txt , ex1data2.txt coursera practical part ex1. i've made such translation julia lang (it went smoothly) , i've been struggling spark...without success.
problem: performance of implementation on spark poor. cannot works correctly. that's why ex1data1.txt added polynomial feature, , worked with: theta0 using setintercept(true) , non-normalized column of 1.0 values(in case set intercept false). receive silly results. so, 've decided start working ex1data2.txt. below can find code , expected result. of course spark result wrong.
did have similar experience? grateful help.
the scala code exd1data2.txt:
import org.apache.spark.mllib.feature.standardscaler import org.apache.spark.mllib.linalg.vectors import org.apache.spark.mllib.optimization.squaredl2updater import org.apache.spark.mllib.regression.{linearregressionmodel, linearregressionwithsgd, labeledpoint} import org.apache.spark.{sparkcontext, sparkconf} object mlibonex1data2 extends app { val conf = new sparkconf() conf.set("spark.app.name", "coursera ex1data2.txt test") val sc = new sparkcontext(conf) val input = sc.textfile("hdfs:///ex1data2.txt") val traindata = input.map { line => val parts = line.split(',') val y = parts(2).todouble val features = vectors.dense(parts(0).todouble, parts(1).todouble) println(s"x = $features y = $y") labeledpoint(y, features) }.cache() // building model val numiterations = 1500 val alpha = 0.01 // scale features val scaler = new standardscaler(withmean = true, withstd = true) .fit(traindata.map(x => x.features)) val scaledtraindata = traindata.map{ td => val normfeatures = scaler.transform(td.features) println(s"normalized features = $normfeatures") labeledpoint(td.label, normfeatures) }.cache() val tsize = scaledtraindata.count() println(s"training set size $tsize") val alg = new linearregressionwithsgd().setintercept(true) alg.optimizer .setnumiterations(numiterations) .setstepsize(alpha) .setupdater(new squaredl2updater) .setregparam(0.0) //regularization - off val model = alg.run(scaledtraindata) println(s"theta $model.weights") val total1 = model.predict(scaler.transform(vectors.dense(1650, 3))) println(s"estimate price of 1650 sq-ft, 3 br house = $total1 dollars") //it should give ~ $289314.620338 // evaluate model on training examples , compute training error val valuesandpreds = scaledtraindata.map { point => val prediction = model.predict(point.features) (point.label, prediction) } val mse = ((valuesandpreds.map{case(v, p) => math.pow((v - p), 2)}.mean()) / 2) println("training mean squared error = " + mse) // save , load model val trysaveandload = util.try(model.save(sc, "mymodelpath")) .flatmap { _ => util.try(linearregressionmodel.load(sc, "mymodelpath")) } .getorelse(-1) println(s"trysaveandload result $trysaveandload") }
stdout result is:
training set size 47
theta (weights=[52090.291641275864,19342.034885388926], intercept=181295.93717032953).weights
estimate price of 1650 sq-ft, 3 br house = 153983.5541846754 dollars
training mean squared error = 1.5876093757127676e10
trysaveandload result -1
well, after digging believe there nothing here. first saved content of valuesandpreds
text file:
valuesandpreds.map{ case {x, y} => s"$x,$y"}.repartition(1).saveastextfile("results.txt")'
rest of code written in r.
first lets create model using closed form solution:
# load data df <- read.csv('results.txt/ex1data2.txt', header=false) # scale features df[, 1:2] <- apply(df[, 1:2], 2, scale) # build linear model model <- lm(v3 ~ ., df)
for reference:
> summary(model) call: lm(formula = v3 ~ ., data = df) residuals: min 1q median 3q max -130582 -43636 -10829 43698 198147 coefficients: estimate std. error t value pr(>|t|) (intercept) 340413 9637 35.323 < 2e-16 *** v1 110631 11758 9.409 4.22e-12 *** v2 -6650 11758 -0.566 0.575 --- signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 residual standard error: 66070 on 44 degrees of freedom multiple r-squared: 0.7329, adjusted r-squared: 0.7208 f-statistic: 60.38 on 2 , 44 df, p-value: 2.428e-13
now prediction:
closedformprediction <- predict(model, df) closedformrmse <- sqrt(mean((closedformprediction - df$v3)**2)) plot( closedformprediction, df$v3, ylab="actual", xlab="predicted", main=paste("closed form, rmse: ", round(closedformrmse, 3)))
now can compare above sgd results:
sgd <- read.csv('results.txt/part-00000', header=false) sgdrmse <- sqrt(mean(sgd$v2 - sgd$v1)**2) plot( sgd$v2, sgd$v1, ylab="actual", xlab="predicted", main=paste("sgd, rmse: ", round(sgdrmse, 3)))
finally lets compare both:
plot( sgd$v2, closedformprediction, xlab="sgd", ylab="closed form", main="sgd vs closed form")
so, result not perfect nothing seems off here.
Comments
Post a Comment