HADOOP|Distributed System|Spark|Python|分布式系统代写

DS/CMPSC 410 MiniProject #4

Spring 2020

Instructor: John Yen

TA: Rupesh Prajapati

Learning Objectives

  • Be able to apply k-means clustering to the Darknet dataset
  • Be able to intepret the features of the cluster centers generated
  • Be able to compare the result of k-means clustering with different value of k

Total points: 35

  • Exercise 1: 10 points
  • Exercise 2: 10 points
  • Exercise 3: 15 points

Due: 5 pm, April 24, 2020.

[-]import pysparkimport csv[-]from pyspark import SparkContextfrom pyspark.sql import SparkSessionfrom pyspark.sql.types import StructField, StructType, StringType, LongTypefrom pyspark.sql.functions import col, columnfrom pyspark.sql.functions import exprfrom pyspark.sql.functions import splitfrom pyspark.sql import Rowfrom pyspark.ml import Pipelinefrom pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler, IndexToStringfrom pyspark.ml.clustering import KMeans[-]ss = SparkSession.builder.master(“local”).appName(“FrequentPortSet”).getOrCreate()[-]Scanners_df = ss.read.csv(“/storage/home/juy1/work/Darknet/scanners-dataset1-anon.csv”, header= True, inferSchema=True )

We can use printSchema() to display the schema of the DataFrame Scanners_df to see whether it was inferred correctly.

[-]Scanners_df.printSchema()

Exercise 1 (10 points)

Use k-means to cluster the scanners who scan the top 3-port-sets. Complete the code below for k-means clustering (k=10) on the following input features:

  • num_ports_scanned
  • avg_lifetime
  • total_lifetime
  • avg_pkt_size

Specify Parameters for k Means Clustering

[-]km = KMeans(featuresCol=”features”, predictionCol=”prediction”).setK(???).setSeed(???)km.explainParams()[-]va = VectorAssembler().setInputCols([???]).setOutputCol(“features”)[-]Scanners_df.printSchema()[-]data= va.transform(???)[-]data.persist()[-]kmModel=km.fit(???)[-]kmModel[-]predictions = ???.transform(???)[-]predictions.persist().show(3)[-]Cluster1_df=predictions.where(col(“prediction”)==0)[-]Cluster1_df.persist().count()[-]summary = kmModel.summary[-]summary.clusterSizes[-]kmModel.computeCost(data)[-]centers = kmModel.clusterCenters()[-]print(“Cluster Centers:”)i=0for center in centers:    print(“Cluster “, str(i+1), center)    i = i+1

Exercise 2 Analyze the result of k-means clustering (k =10) (10 points)

  • (a) Describe what the cluster center for the largest cluster indicate about scanners in this group. (5 point)
  • (b) Describe what the cluster center for the second largest cluster indicate about scanners in this group. (5 point)

Exercise 3 Perform k-means clustering for a different choice of the value of k (15 points)

  • a) Change the value of k to 30. (5 points)
  • b) Compare the “cost” of the result of kmeans with that of k=10. (5 ponts)
  • c) Compare the top two clusters generated with k=30 with the top two clusters generated with k=10. (5 points)

[-]# Code for Exercise 3 (a)

Answer for Exercise 3 (b):

Answer for Exercise 3 (c):

Leave a Reply

Your email address will not be published. Required fields are marked *