Advanced Analytics with Spark Chap-04 に進もう。
———————————————————-
Covtypeデータ・セットは、
https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/
1 2 |
covtype.data.gz 31-Aug-1998 14:01 11M covtype.info 18-Apr-2010 00:01 14K |
展開して得られたcovtype.dataの中身を覗いてみると、55列、581,012行のデータ
1 2 3 4 5 6 7 8 9 10 11 12 |
2596,51,3,258,0,510,221,232,148,6279,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,5 2590,56,2,212,-6,390,220,235,151,6225,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,5 2804,139,9,268,65,3180,234,238,135,6121,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2 2785,155,18,242,118,3090,238,238,122,6211,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,2 2595,45,2,153,-1,391,220,234,150,6172,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,5 2579,132,6,300,-15,67,230,237,140,6031,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,2 2606,45,7,270,5,633,222,225,138,6256,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,5 2605,49,4,234,7,573,222,230,144,6228,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,5 2617,45,9,240,56,666,223,221,133,6244,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,5 2612,59,10,247,11,636,228,219,124,6230,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,5 2612,201,4,180,51,735,218,243,161,6222,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5 .......................... |
covtype.infoを見ると、データ構造は、
1 2 3 4 |
6. Number of Attributes: 12 measures, but 54 columns of data (10 quantitative variables, 4 binary wilderness areas and 40 binary soil type variables) |
Attribute informationには、13項目の説明:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
Given is the attribute name, attribute type, the measurement unit and a brief description. The forest cover type is the classification problem. The order of this listing corresponds to the order of numerals along the rows of the database. Name Data Type Measurement Description Elevation quantitative meters Elevation in meters Aspect quantitative azimuth Aspect in degrees azimuth Slope quantitative degrees Slope in degrees Horizontal_Distance_To_Hydrology quantitative meters Horz Dist to nearest surface water features Vertical_Distance_To_Hydrology quantitative meters Vert Dist to nearest surface water features Horizontal_Distance_To_Roadways quantitative meters Horz Dist to nearest roadway Hillshade_9am quantitative 0 to 255 index Hillshade index at 9am, summer solstice Hillshade_Noon quantitative 0 to 255 index Hillshade index at noon, summer soltice Hillshade_3pm quantitative 0 to 255 index Hillshade index at 3pm, summer solstice Horizontal_Distance_To_Fire_Points quantitative meters Horz Dist to nearest wildfire ignition points Wilderness_Area (4 binary columns) qualitative 0 (absence) or 1 (presence) Wilderness area designation Soil_Type (40 binary columns) qualitative 0 (absence) or 1 (presence) Soil Type designation Cover_Type (7 types) integer 1 to 7 Forest Cover Type designation |
他の行には、それぞれのデータの説明が記載されている。
とりあえず、HDFSにデータファイルを移す。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
MacBook-Pro-5:3.1.1 $ cd /usr/local/Cellar/hadoop/3.1.1/sbin/ MMacBook-Pro-5:sbin $ ./start-dfs.sh Starting namenodes on [localhost] Starting datanodes Starting secondary namenodes [MacBook-Pro-5] MacBook-Pro-5:sbin $ jps 40337 NameNode 40585 SecondaryNameNode 40699 Jps 40444 DataNode MacBook-Pro-5:sbin $ ./start-yarn.sh WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. Starting resourcemanager Starting nodemanagers localhost: WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS. MacBook-Pro-5:sbin $ jps 40337 NameNode 41046 Jps 40585 SecondaryNameNode 40843 ResourceManager 40444 DataNode 40959 NodeManager MacBook-Pro-5:sbin $ cd .. MacBook-Pro-5:3.1.1 $ cd bin MacBook-Pro-5:bin $ ./hadoop fs -ls / Found 2 items drwxr-xr-x - ******* supergroup 0 2018-09-24 00:15 /linkage drwxr-xr-x - ******* supergroup 0 2018-09-29 15:04 /user MacBook-Pro-5:bin $ ./hadoop fs -cd user MacBook-Pro-5:bin $ ./hadoop fs -ls Found 2 items drwxr-xr-x - ******* supergroup 0 2018-09-29 15:02 ds drwxr-xr-x - ******* supergroup 0 2018-09-23 17:46 output MacBook-Pro-5:bin teijisw$ ./hadoop fs -put /Users/*******/Desktop/covtype.data /user/*******/ds MacBook-Pro-5:bin $ ./hadoop fs -ls ds/ Found 4 items -rw-r--r-- 3 ******* supergroup 2932731 2018-09-29 15:02 ds/artist_alias.txt -rw-r--r-- 3 ******* supergroup 55963575 2018-09-29 15:02 ds/artist_data.txt -rw-r--r-- 3 ******* supergroup 75169317 2018-10-08 08:35 ds/covtype.data -rw-r--r-- 3 ******* supergroup 426761761 2018-09-29 15:02 ds/user_artist_data.tx |
SparkをCPUマルチコア、6Gメモリで立ち上げる。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
MacBook-Pro-5:bin $ ${SPARK_HOME}/bin/spark-shell --master local[*] --driver-memory 6g 2018-10-08 09:17:19 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://macbook-pro-5:4040 Spark context available as 'sc' (master = local[*], app id = local-1538957848581). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.3.1 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181) Type in expressions to have them evaluated. Type :help for more information. |
続いて、決定木Decision Treeの実装へ:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
scala> import org.apache.spark.mllib.linalg._ import org.apache.spark.mllib.linalg._ scala> import org.apache.spark.mllib.regression._ import org.apache.spark.mllib.regression._ scala> val rawData = sc.textFile("hdfs://localhost/user/*******/ds/covtype.data") rawData: org.apache.spark.rdd.RDD[String] = hdfs://localhost/user/*******/ds/covtype.data MapPartitionsRDD[1] at textFile at <console>:30 scala> rawData.first res1: String = 2596,51,3,258,0,510,221,232,148,6279,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,5 scala> val data = rawData.map { line => | val values = line.split(',').map(_.toDouble) | val featureVector = Vectors.dense(values.init) | val label = values.last - 1 | LabeledPoint(label, featureVector) | } data: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] = MapPartitionsRDD[2] at map at <console>:31 scala> data.first res2: org.apache.spark.mllib.regression.LabeledPoint = (4.0,[2596.0,51.0,3.0,258.0,0.0,510.0,221.0,232.0,148.0,6279.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]) |
Vectorsは、org.apache.spark.mllib.linalgのクラス。dense()は
1 2 |
dense(double[] values) Creates a dense vector from a double array. |
Scalaのinitはリストの最後の要素を除いた残りのリストを求める.
最後にorg.apache.spark.mllib.regressionのClass LabeledPointを返す。
LabeledPointは、
1 |
LabeledPoint(double label, Vector features) |