Advanced Analytics from Spark #3 協調フィルタリング #4

大急ぎで、協調フィルタリングALSを動かしてみたが、もう少し中身のScalaコードを吟味してみよう。
ーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーー
兎にも角にもAUC以下の部分が難解であるが、そもそもROC曲線とAUCについて、以下のリンクサイトで復習しておこう。
http://www.randpy.tokyo/entry/roc_auc

そこでareaUnderCurve()関数について、チェックしてみる。

areaUnderCurve(
     |       positiveData: RDD[Rating],
     |       bAllItemIDs: Broadcast[Array[Int]],
     |       predictFunction: (RDD[(Int,Int)] => RDD[Rating]))

areaUnderCurve(

| positiveData: RDD[Rating],

| bAllItemIDs: Broadcast[Array[Int]],

| predictFunction: (RDD[(Int,Int)] => RDD[Rating]))

であるから、このAUCを算定する関数の引数は3つあり、RDD[Rating]クラスのpositiveData、Broadcast[Array[Int]クラスのbALLItemIDs、と第3引数は関数predicFunction()である。
関数predicFunction()は、RDD([Int, Int)]を引数として、RDD[Rating]を返り値とする。
この関数は、AUCの算定では以下のように使用されている：

scala> val auc = areaUnderCurve(cvData, bAllItemIDs, model.predict)
auc: Double = 0.9635602721348083

1 2	scala> val auc = areaUnderCurve(cvData, bAllItemIDs, model.predict) auc: Double = 0.9635602721348083

返り値であるAUC値は、Doubleの浮動小数点であり、areaUnderCurve()のpositivePredictions.join(negativePredictions).values.map {……}.mean()で返される。
cvDataは、各ユーザにとって良いpositiveなアーティスト群というかたちでの交差検証用CV（Cross validation)集合。第3引数は、MatrixFactoriztionModelのpredict()メソッドである。predict()関数は、ユーザー、アーティスト、レコメンデーションの値を含むRatingという予測値へ変換する。
ちなみにcvDataを覗いてみると

scala> cvData.first
res36: org.apache.spark.mllib.recommendation.Rating = Rating(1000002,1000028,17.0)

1 2	scala> cvData.first res36: org.apache.spark.mllib.recommendation.Rating = Rating(1000002,1000028,17.0)

であり、元となっているallDataは、ユーザー・アーティスト・再生回数データrawUserArtistDataとアーティストIDの正しいIDへの変換マップArtistAliasをブロードキャストしたbArtistAliasを引数にして、buidlRating()関数でRatingへ算定変換したもの。

scala> val allData = buildRatings(rawUserArtistData, bArtistAlias)

1	scala> val allData = buildRatings(rawUserArtistData, bArtistAlias)

ALSの実装には、入力データをトレーニング用と、モデルの評価に用いるCV用データセットに分ける必要がある。
ここでは、以下のコードにより、90%のデータをトレーニング用、10%を交差検証用に用いる。

scala> val Array(trainData, cvData) = allData.randomSplit(Array(0.9, 0.1))

1	scala> val Array(trainData, cvData) = allData.randomSplit(Array(0.9, 0.1))

————————————————————–
あとの理解は、RecommendationのALSについての理解に尽きると思うので、Sparkの以下のコードが関わるAPIを参照してみよう。

scala> val model = ALS.trainImplicit(trainData, 10, 5, 0.01, 1.0)
model: org.apache.spark.mllib.recommendation.MatrixFactorizationModel = org.apache.spark.mlli
b.recommendation.MatrixFactorizationModel@25b61308

scala> val model = ALS.trainImplicit(trainData, 10, 5, 0.01, 1.0)

model: org.apache.spark.mllib.recommendation.MatrixFactorizationModel = org.apache.spark.mlli

b.recommendation.MatrixFactorizationModel@25b61308

Spark1.4.0のAPI：
https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/mllib/recommendation/package-summary.html
で、Package org.apache.spark.ml.recommendationのClass Summary

Package org.apache.spark.mllib.recommendation
Class Summary 
Class	Description
ALS	 
MatrixFactorizationModel	
Model representing the result of matrix factorization.
Rating	
A more compact class to represent a rating than Tuple3[Int, Int, Double].

Package org.apache.spark.mllib.recommendation

Class Summary

Class Description

ALS

MatrixFactorizationModel

Model representing the result of matrix factorization.

Rating

A more compact class to represent a rating than Tuple3[Int, Int, Double].

val model = ALS.trainImplicit(trainData, 10, 5, 0.01, 1.0)のtrainImplicit()関数は、ALSクラスに対するメソッドで、MaticFactorizationModelのインスタンスを返す。

static MatrixFactorizationModel trainImplicit (RDD<Rating> ratings, int rank, int iterations, double lambda, double alpha)
Train a matrix factorization model given an RDD of 'implicit preferences' given by users to some products, in the form of (userID, productID, preference) pairs.

1 2	static MatrixFactorizationModel trainImplicit (RDD<Rating> ratings, int rank, int iterations, double lambda, double alpha) Train a matrix factorization model given an RDD of 'implicit preferences' given by users to some products, in the form of (userID, productID, preference) pairs.

以下の命令では、MatrixFactorizationModelのインスタンスmodelに対して、メソッドrecommendProductsで、Ratingとして推奨される５つのアーティストが返される。

scala> val someRecommendations =
     |   someUsers.map(userID => model.recommendProducts(userID, 5))

1 2	scala> val someRecommendations = \| someUsers.map(userID => model.recommendProducts(userID, 5))

public class MatrixFactorizationModel

recommendProducts
public Rating[] recommendProducts(int user,
                         int num)
Recommends products to a user.
Parameters:
user - the user to recommend products to
num - how many products to return. The number returned may be less than this.
Returns:
Rating objects, each of which contains the given user ID, a product ID, and a "score" in the rating field. Each represents one recommended product, and they are sorted by score, decreasing. The first returned is the one predicted to be most strongly recommended to the user. The score is an opaque value that indicates how strongly recommended the product is.

public class MatrixFactorizationModel

recommendProducts

public Rating[] recommendProducts(int user,

int num)

Recommends products to a user.

Parameters:

user - the user to recommend products to

num - how many products to return. The number returned may be less than this.

Returns:

Rating objects, each of which contains the given user ID, a product ID, and a "score" in the rating field. Each represents one recommended product, and they are sorted by score, decreasing. The first returned is the one predicted to be most strongly recommended to the user. The score is an opaque value that indicates how strongly recommended the product is.

Science To Medicine

Just My Daily Study Note by ts.anesth.kpum

Advanced Analytics from Spark #3 協調フィルタリング #4