Spark実践 #7 PySpark MLlib その4

前回に続いて、23,534件の麻酔データ:エホチール、エフェドリン、ネオシネジンのいずれかを使用したどうかを術前データから予想。今回は、RDDとMLlibを用いる。
ーーーーーーーーーーーーーーーーーーーーーーーー
MLib package of PySpark
Load and transform the data

pressor: 0.54 0.50
dept: 13.81 6.57
operoom: 6.96 3.92
register: 1.34 0.82
anesthesia: 2.97 0.78
ASA: 2.07 1.53
age: 48.53 27.04
sex: 0.50 0.50
height: 149.28 35.97
weight: 51.32 28.35
age_cat: 5.37 2.67
time_cat: 10.38 2.93
ane_start: 10.46 2.93
ope_time: 199.68 146.16
ope_portion: 23.26 13.05
position: 4.87 0.67

pressor-to-age: 0.53
pressor-to-age_cat: 0.52
register-to-ASA: 0.60
ASA-to-register: 0.60
age-to-pressor: 0.53
age-to-height: 0.52
age-to-age_cat: 0.96
height-to-age: 0.52
height-to-weight: 0.84
weight-to-height: 0.84
age_cat-to-pressor: 0.52
age_cat-to-age: 0.96
time_cat-to-ane_start: 0.99
ane_start-to-time_cat: 0.99

[Row(pressor=0, dept=23, operoom=10, register=4, anesthesia=0, ASA=0, age=2, sex=0, height=87.5, time_cat=0, ope_time=90, ope_portion=0, position=0),
Row(pressor=0, dept=24, operoom=13, register=1, anesthesia=0, ASA=0, age=80, sex=0, height=157.0, time_cat=11, ope_time=180, ope_portion=0, position=0),
Row(pressor=0, dept=18, operoom=1, register=1, anesthesia=0, ASA=0, age=64, sex=0, height=157.5, time_cat=12, ope_time=60, ope_portion=0, position=0),
Row(pressor=0, dept=9, operoom=1, register=4, anesthesia=0, ASA=0, age=71, sex=1, height=149.0, time_cat=1, ope_time=60, ope_portion=0, position=0),
Row(pressor=0, dept=22, operoom=14, register=3, anesthesia=0, ASA=0, age=38, sex=0, height=167.9, time_cat=4, ope_time=210, ope_portion=0, position=0)]

[Row(pressor=0, dept=23, operoom=10, register=4, anesthesia=0, ASA=0, age=2, sex=0, height=87.5, weight=12.8, age_cat=1, time_cat=0, ane_start=0, ope_time=90, ope_portion=0, position=0),
Row(pressor=0, dept=24, operoom=13, register=1, anesthesia=0, ASA=0, age=80, sex=0, height=157.0, weight=57.2, age_cat=9, time_cat=11, ane_start=0, ope_time=180, ope_portion=0, position=0),
Row(pressor=0, dept=18, operoom=1, register=1, anesthesia=0, ASA=0, age=64, sex=0, height=157.5, weight=52.2, age_cat=7, time_cat=12, ane_start=0, ope_time=60, ope_portion=0, position=0),
Row(pressor=0, dept=9, operoom=1, register=4, anesthesia=0, ASA=0, age=71, sex=1, height=149.0, weight=48.0, age_cat=8, time_cat=1, ane_start=2, ope_time=60, ope_portion=0, position=0),
Row(pressor=0, dept=22, operoom=14, register=3, anesthesia=0, ASA=0, age=38, sex=0, height=167.9, weight=62.8, age_cat=4, time_cat=4, ane_start=4, ope_time=210, ope_portion=0, position=0)]

[LabeledPoint(0.0, [23.0,10.0,4.0,0.0,0.0,2.0,0.0,87.5,0.0,90.0,0.0,0.0]),
LabeledPoint(0.0, [24.0,13.0,1.0,0.0,0.0,80.0,0.0,157.0,11.0,180.0,0.0,0.0]),
LabeledPoint(0.0, [18.0,1.0,1.0,0.0,0.0,64.0,0.0,157.5,12.0,60.0,0.0,0.0]),
LabeledPoint(0.0, [9.0,1.0,4.0,0.0,0.0,71.0,1.0,149.0,1.0,60.0,0.0,0.0]),
LabeledPoint(0.0, [22.0,14.0,3.0,0.0,0.0,38.0,0.0,167.9,4.0,210.0,0.0,0.0])]

[LabeledPoint(0.0, [24.0,13.0,1.0,0.0,0.0,80.0,0.0,157.0,11.0,180.0,0.0,0.0]),
LabeledPoint(0.0, [18.0,1.0,1.0,0.0,0.0,64.0,0.0,157.5,12.0,60.0,0.0,0.0]),
LabeledPoint(0.0, [13.0,3.0,1.0,0.0,0.0,7.0,1.0,130.5,8.0,60.0,0.0,0.0]),
LabeledPoint(0.0, [3.0,13.0,1.0,0.0,0.0,5.0,1.0,106.4,8.0,180.0,0.0,0.0]),
LabeledPoint(0.0, [9.0,5.0,1.0,0.0,0.0,9.0,0.0,132.0,8.0,150.0,0.0,0.0])]

(weights=[-0.0192025792795,-0.0044390822126,-0.052370072204,-0.401262031975,0.11966806915,0.0474597210393,0.316300662306,-0.00267376233066,-0.0613497401323,0.00509484568777,-0.000437665331756,-0.176261999147], intercept=0.0)

[(0.0, 1.0), (0.0, 1.0), (0.0, 1.0), (0.0, 1.0), (0.0, 0.0)]

Area under PR: 0.87
Area under ROC: 0.78

accuracy: 0.77
weightedFalsePositiveRate: 0.22

Selecting only the most predictable features

Random Forest in Spark

Area under PR: 0.86
Area under ROC: 0.71

TreeEnsembleModel classifier with 6 trees

Tree 0:
If (feature 3 <= 1.0) If (feature 2 <= 2.0) If (feature 2 <= 0.0) If (feature 1 <= 11.0) Predict: 0.0 Else (feature 1 > 11.0)
Predict: 0.0
Else (feature 2 > 0.0)
If (feature 0 <= 15.0) Predict: 1.0 Else (feature 0 > 15.0)
Predict: 1.0
Else (feature 2 > 2.0)
If (feature 0 <= 14.0) If (feature 0 <= 11.0) Predict: 0.0 Else (feature 0 > 11.0)
Predict: 0.0
Else (feature 0 > 14.0)
If (feature 1 <= 8.0) Predict: 0.0 Else (feature 1 > 8.0)
Predict: 1.0
Else (feature 3 > 1.0)
If (feature 0 <= 14.0) If (feature 0 <= 11.0) If (feature 2 <= 2.0) Predict: 1.0 Else (feature 2 > 2.0)
Predict: 1.0
Else (feature 0 > 11.0)
If (feature 0 <= 13.0) Predict: 0.0 Else (feature 0 > 13.0)
Predict: 0.0
Else (feature 0 > 14.0)
If (feature 0 <= 17.0) If (feature 2 <= 2.0) Predict: 1.0 Else (feature 2 > 2.0)
Predict: 1.0
Else (feature 0 > 17.0)
If (feature 2 <= 5.0) Predict: 1.0 Else (feature 2 > 5.0)
Predict: 0.0
Tree 1:
If (feature 3 <= 1.0) If (feature 2 <= 2.0) If (feature 2 <= 0.0) If (feature 0 <= 3.0) Predict: 0.0 Else (feature 0 > 3.0)
Predict: 1.0
Else (feature 2 > 0.0)
If (feature 0 <= 15.0) Predict: 1.0 Else (feature 0 > 15.0)
Predict: 1.0
Else (feature 2 > 2.0)
If (feature 0 <= 14.0) If (feature 0 <= 11.0) Predict: 0.0 Else (feature 0 > 11.0)
Predict: 0.0
Else (feature 0 > 14.0)
If (feature 1 <= 8.0) Predict: 0.0 Else (feature 1 > 8.0)
Predict: 1.0
Else (feature 3 > 1.0)
If (feature 0 <= 14.0) If (feature 0 <= 11.0) If (feature 2 <= 2.0) Predict: 1.0 Else (feature 2 > 2.0)
Predict: 1.0
Else (feature 0 > 11.0)
If (feature 0 <= 13.0) Predict: 0.0 Else (feature 0 > 13.0)
Predict: 0.0
Else (feature 0 > 14.0)
If (feature 0 <= 17.0) If (feature 2 <= 2.0) Predict: 1.0 Else (feature 2 > 2.0)
Predict: 1.0
Else (feature 0 > 17.0)
If (feature 2 <= 5.0) Predict: 1.0 Else (feature 2 > 5.0)
Predict: 0.0
Tree 2:
If (feature 3 <= 1.0) If (feature 0 <= 14.0) If (feature 0 <= 11.0) If (feature 0 <= 3.0) Predict: 0.0 Else (feature 0 > 3.0)
Predict: 0.0
Else (feature 0 > 11.0)
If (feature 2 <= 2.0) Predict: 0.0 Else (feature 2 > 2.0)
Predict: 0.0
Else (feature 0 > 14.0)
If (feature 0 <= 17.0) If (feature 2 <= 2.0) Predict: 1.0 Else (feature 2 > 2.0)
Predict: 1.0
Else (feature 0 > 17.0)
If (feature 1 <= 8.0) Predict: 0.0 Else (feature 1 > 8.0)
Predict: 1.0
Else (feature 3 > 1.0)
If (feature 0 <= 14.0) If (feature 0 <= 11.0) If (feature 2 <= 2.0) Predict: 1.0 Else (feature 2 > 2.0)
Predict: 1.0
Else (feature 0 > 11.0)
If (feature 0 <= 13.0) Predict: 0.0 Else (feature 0 > 13.0)
Predict: 0.0
Else (feature 0 > 14.0)
If (feature 0 <= 17.0) If (feature 3 <= 4.0) Predict: 1.0 Else (feature 3 > 4.0)
Predict: 1.0
Else (feature 0 > 17.0)
If (feature 2 <= 5.0) Predict: 1.0 Else (feature 2 > 5.0)
Predict: 0.0
Tree 3:
If (feature 3 <= 1.0) If (feature 0 <= 14.0) If (feature 0 <= 11.0) If (feature 2 <= 2.0) Predict: 1.0 Else (feature 2 > 2.0)
Predict: 0.0
Else (feature 0 > 11.0)
If (feature 0 <= 13.0) Predict: 0.0 Else (feature 0 > 13.0)
Predict: 0.0
Else (feature 0 > 14.0)
If (feature 0 <= 17.0) If (feature 2 <= 2.0) Predict: 1.0 Else (feature 2 > 2.0)
Predict: 1.0
Else (feature 0 > 17.0)
If (feature 1 <= 8.0) Predict: 0.0 Else (feature 1 > 8.0)
Predict: 1.0
Else (feature 3 > 1.0)
If (feature 0 <= 14.0) If (feature 0 <= 11.0) If (feature 2 <= 2.0) Predict: 1.0 Else (feature 2 > 2.0)
Predict: 1.0
Else (feature 0 > 11.0)
If (feature 0 <= 13.0) Predict: 0.0 Else (feature 0 > 13.0)
Predict: 0.0
Else (feature 0 > 14.0)
If (feature 0 <= 17.0) If (feature 3 <= 5.0) Predict: 1.0 Else (feature 3 > 5.0)
Predict: 1.0
Else (feature 0 > 17.0)
If (feature 2 <= 5.0) Predict: 1.0 Else (feature 2 > 5.0)
Predict: 0.0
Tree 4:
If (feature 3 <= 1.0) If (feature 0 <= 14.0) If (feature 0 <= 11.0) If (feature 2 <= 2.0) Predict: 1.0 Else (feature 2 > 2.0)
Predict: 0.0
Else (feature 0 > 11.0)
If (feature 1 <= 1.0) Predict: 0.0 Else (feature 1 > 1.0)
Predict: 0.0
Else (feature 0 > 14.0)
If (feature 0 <= 17.0) If (feature 2 <= 2.0) Predict: 1.0 Else (feature 2 > 2.0)
Predict: 1.0
Else (feature 0 > 17.0)
If (feature 1 <= 8.0) Predict: 0.0 Else (feature 1 > 8.0)
Predict: 1.0
Else (feature 3 > 1.0)
If (feature 0 <= 14.0) If (feature 0 <= 11.0) If (feature 2 <= 2.0) Predict: 1.0 Else (feature 2 > 2.0)
Predict: 1.0
Else (feature 0 > 11.0)
If (feature 0 <= 13.0) Predict: 0.0 Else (feature 0 > 13.0)
Predict: 0.0
Else (feature 0 > 14.0)
If (feature 0 <= 17.0) If (feature 2 <= 2.0) Predict: 1.0 Else (feature 2 > 2.0)
Predict: 1.0
Else (feature 0 > 17.0)
If (feature 2 <= 5.0) Predict: 1.0 Else (feature 2 > 5.0)
Predict: 0.0
Tree 5:
If (feature 3 <= 1.0) If (feature 0 <= 14.0) If (feature 0 <= 11.0) If (feature 2 <= 2.0) Predict: 1.0 Else (feature 2 > 2.0)
Predict: 0.0
Else (feature 0 > 11.0)
If (feature 0 <= 13.0) Predict: 0.0 Else (feature 0 > 13.0)
Predict: 0.0
Else (feature 0 > 14.0)
If (feature 0 <= 17.0) If (feature 2 <= 2.0) Predict: 1.0 Else (feature 2 > 2.0)
Predict: 1.0
Else (feature 0 > 17.0)
If (feature 1 <= 8.0) Predict: 0.0 Else (feature 1 > 8.0)
Predict: 0.0
Else (feature 3 > 1.0)
If (feature 0 <= 14.0) If (feature 0 <= 11.0) If (feature 2 <= 2.0) Predict: 1.0 Else (feature 2 > 2.0)
Predict: 1.0
Else (feature 0 > 11.0)
If (feature 0 <= 13.0) Predict: 0.0 Else (feature 0 > 13.0)
Predict: 0.0
Else (feature 0 > 14.0)
If (feature 2 <= 5.0) If (feature 0 <= 17.0) Predict: 1.0 Else (feature 0 > 17.0)
Predict: 1.0
Else (feature 2 > 5.0)
If (feature 1 <= 8.0) Predict: 0.0 Else (feature 1 > 8.0)
Predict: 0.0

Area under PR: 0.79
Area under ROC: 0.61

(weights=[0.0539572066071,-0.0139919738754,-0.27993351311,0.162644911917], intercept=0.0)