# 0.1 (50 pts) Part 1: Missing Data Analysis and Dimensionality Reduction

1

January 27, 2020

To answer the rst part of Lab1, use the data as described in the Jupyter tutorial presented in class and

derived from the following paper

escu, and Rajiv Ramnath. \Accident Risk Prediction based on Heterogeneous Sparse

Data: New Dataset and Insights.” In proceedings of the 27th ACM SIGSPATIAL

International Conference on Advances in Geographic Information Systems, ACM, 2019.

(https://arxiv.org/abs/1909.09638).

The data can also be found at https://www.kaggle.com/sobhanmoosavi/us-accidents.

0.1.1 Objective of this Lab:

Inform you about best practice pertaining missing data.

Reducing the number of attributes.

0.1.2 (25 pts) Question 1: Missing Data

Missing data typically occurs when no data value is observed for any given set of attributes. This can have

a signicant impact on the inference that can be derived from the data and has to be addressed carefully to

minimize said impact. The \accident risk” dataset has a certain amount of missing data. There is a need to

address this problem. Fill in missing values for every data column with continuous data using the following

three strategies:

In : import pandas as pd

# be sure to make use of the full dataset.

Out: ID Source TMC Severity Start Time n

0 A-603309 MapQuest 241.0 2 2019-02-15 16:56:38

1 A-676455 MapQuest 201.0 2 2019-01-26 15:54:00

2 A-2170526 Bing NaN 4 2018-01-06 10:51:40

3 A-1162086 MapQuest 201.0 2 2018-05-25 16:12:02

4 A-309058 MapQuest 201.0 2 2016-10-26 19:42:11

End Time Start Lat Start Lng End Lat End Lng n

0 2019-02-15 17:26:16 34.725163 -86.596359 NaN NaN

1 2019-01-26 16:25:00 32.883579 -80.019722 NaN NaN

2 2018-01-06 16:51:40 39.979172 -82.983870 39.99384 -82.98502

3 2018-05-25 17:11:49 34.857208 -82.256157 NaN NaN

1

4 2016-10-26 20:26:58 47.662689 -117.357658 NaN NaN

… Roundabout Station Stop Traffic Calming n

0 … False False False False

1 … False False False False

2 … False False False False

3 … False False False False

4 … False False False False

Traffic Signal Turning Loop Sunrise Sunset Civil Twilight Nautical Twilight n

0 False False Day Day Day

1 False False Day Day Day

2 False False Day Day Day

3 True False Day Day Day

4 True False Night Night Night

Astronomical Twilight

0 Day

1 Day

2 Day

3 Day

4 Night

[5 rows x 49 columns]

Fill in missing values for every continuous data column using the following three strategies

(10 pts) Q 1.1: Use the mean to ll the missing values for any given attribute. First collect the

mean value of all the observations of each attribute with missing data. Then use the mean value (calculated

per attribute) to ll in for the missing data.

In : # code here

(15 pts) Q 1.2: Randomly sample based on the probability distribution of the given column

data. Generate and ll missing values using random samples from the probability distribution estimated

from the observed values for each column. Verify your estimated pdf by comparing the histogram of observed

values vs estimated values. Remember to ll all the columns containing missing data.

Steps to generate random samples by estimating the probability distribution: * First generate

the probability distribution of the attribute using the existing observations. – Draw Histogram

of the observed values – Estimate the distribution using KDE * Using the pdf nd the cdf (make

sure the cdf is normalized to values 0 to 1). * Generate the inverse cdf function, and pass a

random value [0,1] from a uniform distribution to the inverse function to get random samples

from our estimated pdf.

i.e.,

r = Uniform(0; 1)

x = cdf 􀀀1(r)

In : # code here

2

(Bonus – 15 pts) Q 1.3: Use linear regression using other continuous columns ll in the missing

data. Make sure to use multiple attributes (columns) that do not contain too missing values (< 5%) as

training features or attributes. Let us say that you select four training attributes each with some possible

missing values. Now, select only those rows which are fully populated for each of the four attributes. Next,

use multivariate regression to predict the missing values.

In [ ]: # code here

import numpy as np

from sklearn.linear_model import LinearRegression

# create 2d array X here

# X = All the other predictors as a 2d array.

# where, rows are the samples and columns are training features,

# or attributes that have < 5% missing values

# create 1d array y here

# y = Missing data column that you wish to fill. It must

# only contain the observed values for training.

# train the model

reg = LinearRegression().fit(X, y)

# predict the values using the same set of training features.

# here the rows are samples that contain missing values in the y column,

# and the columns are the same training features for those samples.

y_miss = reg.predict(X_miss)

#now repeat this for all columns containing missing data.

0.1.3 (25 points) Question 2: Dimensionality Reduction

Often data is collected with a certain design and purpose. After the collection is over, it is often determined

that certain attributes do not have any impact on the sought inferences and/or predictions. It is therefore

of value (storage, computing time, etc.) to reduce the amount of time. Think of compression schemes. We

ask you to do the same for the accident risk data. There are many methods to reduce data size. For this

laboratory, we ask you to use a dimensionality reduction called Principal Component Analysis or PCA.

Conduct dimensionality reduction using PCA on the two datasets generated after answering Question

1.1 and Question 1.2 respectively. Follow these steps:

1. Conduct exploratory analysis in the following manner

1.1. Consider only the columns with continuous data (there should be 10 weather related

columns and nine should be continuous).

In : # code here

1.2. Choose ve attributes you think are important to predict severity. Please feel to

consult the paper if necessary.

In [ ]: # code here

3

1.3. Plot scatter plot matrix for all the ve attributes for 1000 randomly selected points.

In [ ]: # code here

1.4. Comment on what you observe.

1.5. Plot histograms for each of the ve data columns (all in one gure).

In : # code here

2. Now, conduct a PCA on all 9 weather attributes and select the two most principal compo-

nents.

In [ ]: import numpy as np

from sklearn.decomposition import PCA

def pca_model(X):

# code here

pca = PCA(n_components=2)

pca.fit(X)

return pca

# build X

# X is a 2d matrix, where rows are the number

# of samples and the columns are the 9 weather

# related continuous features.

pca = pca_model(X)

3. Retain only two columns of the transformed data.

In [ ]: # now transform the data

X_2col = pca.transform(X)

# code to display or visualize the two columns

4. Next, visualize the transformed data in two dimensions.

4.1. Draw a scatter plot using the two dimensions. Use four dierent colors for each

datapoint, one for each severity category.

In : def pca_plotter(X, severity, s_colors):

“””

X is the data to be transformed.

severity is the data column containing severity information

s_colors is a list containing the colors to assign each severity value.

“””

# code here

4

4.2 What dierences do you observe for the two or three datasets (from answers to Q1.1,

Q1.2 and/or Q1.3) after dimensionality reduction.

In : # code here

# X_q1 = PCA transformed data after Q1.1

# X_q2 = PCA transformed data after Q1.2

# X_q3 = PCA transformed data after Q1.3 if you answered the bonus question.

for x in [X_q1, X_q2, X_q3]:

pca_plotter(x, severity, s_colors)

0.2 (50 pts)Part 2: Visualizations and Explorations of Unstructured Text Data

Text data is hard to work with and require special tools to analyze and visualize. We will analyze text

corpora using parsers and tokenizers. Then, using word clouds we will visualize the distributions of word

tokens in perceptually meaningful way. Thus, we will then be able to extract and glean patterns from

unstructured text data. We will use this data collection for this part of the laboratory.

0.2.1 (25 pts) Question 1: Create a word cloud for tags and answer the following questions.

Please visit the following data source: https://www.kaggle.com/stackover

This data consists full text of all questions and answers from Stack Overow that are tagged with the

python tag. This data is useful for developing methods of natural language processing analytic methods

pertaining to \Q&A” and community response.

It is normally of interest to nd determine a variety of words (or tags) which either occur frequently or

occur very rarely. Our goal is to nd those words and also accord those tags/words a measure of strength.

To weigh each word’s strength:

1. Use frequency counts for each tag to highlight most frequent tags.

In [ ]: df = pd.read_csv(“./data/Tags.csv”)

tags = # code here to extract tag as a list

def freq_counter(tags):

# code here

pass

d = freq_counter(tags):

# d = dict where keys are the tags,

# and values are its corresponding frequency counts.

print(d)

2. Collect the results in a table manifest as alltags.csv le.

In [ ]: import pandas as pd

# code here to convert to dataframe save csv file

3. Plot the histogram of the tags. Choose an appropriate number to plot.

In [ ]: def plot_histogram(freq_dict):

# code here

5

plot_histogram(freq_dict)

4. What distribution would best describe the occurence of tags ? What do you observe ?

5. Now use inverse of frequency counts (1/f) to highlight rarely occurring tags.

In [ ]: def inverse_freq_counter(tags):

# code here

invf = inverse_freq_counter(tags)

plot_histogram(invf)

6. Collect the results in a table manifest as a mostfrequent.csv le.

In : # code here

0.2.2 (25 pts) Question 2: Use word clouds to answer the following questions.

Find other two associated data necessary for the second part of this lab-

oratory. https://www.kaggle.com/astackover

https://www.kaggle.com/stackoverow/pythonquestions#Questions.csv. Here you will tokenize and

extract frequency counts. Make sure to also remove stop words. These frequently occur in English and

include articles of speech, tense, etc. (e.g., \an”, \a”, \the”, \is”, \at”, etc.). Typically in text pre-processing

stop words are ltered out before the actual processing of any natural language text.

A typical word cloud shows all the words in a collection placed in such a way that the frequently occurring

words appear prominent (large, colorful, etc.). Word clouds can help identify trends and patterns that would

otherwise be unclear or dicult to see in a tabular format. Visit https://en.wikipedia.org/wiki/Tag cloud

for some explanation.

Use nltk (pyPI)(github)(docs) for tokenization and general text processing needs.

Make use of the following python package wordcloud (pyPI)(github)(docs) to do all the heavy

lifting with the actual visualization aspects.

1. Build a wordcloud for the python tags and compare the visualization with the histogram.

Do you observe any signicant dierences between the two types of visualizations?

In [ ]: from wordcloud import WordCloud

def plot_word_cloud(freq_dict):

# code here

wc = WordCloud(background_color=”black”)

wc.generate_from_frequencies(freq_dict)

# run this to plot them both

plot_histogram(freq_dict)

plot_word_cloud(freq_dict)

6

2. What are the most frequent words in Questions.csv? What are the most frequent word

In [ ]: import nltk

import pandas as pd

def text_freq_counter(texts):

# code here

# be sure to remove all stop words in questions and answer data

from nltk.corpus import stopwords

print(set(stopwords.words(‘english’)))

# Generate word cloud here

plot_word_cloud(freq_dict)

3. What are the most frequent words combined (StackOverow Questions + Answers)?

Comment on the dierences you observe between the three!

In : # code here

4. Assign a dierent color to words depending on where the belong. Assign shades of red

for words that only exist in StackOverow Questions, blue for only StackOverow Answers

and purple for the combined. What do you observe?

In : # code here

5. Do you nd the same set of words appear as highly frequent in all three collections? If

yes, highlight them. Describe how you picked and decided on which set of words would be

deemed as highly frequent. What criteria did you use ? How many did you choose to display

? Why ?

In : # code here

6. Can you nd words that are more frequent in StackOverow Answers but least frequent

in StackOverow Questions? What about visa versa ? What can you say about the dierences?

In [ ]: # code here

0.3 Appendix A

To decide how to handle missing data, we must rst better understand why exactly are they missing

and what could be the causes. Next, we can deploy some strategies to handle missing data. For more

information, please peruse this manuscript. Also, some introduction can be found on this Wiki page

(https://en.wikipedia.org/wiki/Missing data). ### Mechanics of Missing Data:

(Rubin, Donald B. \Inference and missing data.” Biometrika 63.3 (1976): 581-592.)

There are some well understood mechanisms or assumptions you can make of your missing data to make

an informed decision on how to handle your missing data: 1. Missing Completely At Random(MCAR):

If an attribute has missing data points completely and happens at random, it can then be assumed that

the probability of an attribute missing its value is uniformly the same for every sample. For example,

everytime the dice rolls 6, do not enter the value. In such scenarios, if missing data < 5%, removing the

7

samples containing missing data will not bias your inference. Since the observed distribution is the same as

unobserved, if the missing data is more substantial (5 􀀀 40%), one impute using linear (numerical data) or

logistic (categorical) regression.

2. Missing At Random(MAR): In reality nothing is truly random, a variable is usually dependent or

conditional on existing attributes. This dependency is often hidden and cannot be easily discerned.

For example, if blood pressure data are missing at random, conditional on age, gender and ethnicity,

then the distributions of missing and observed blood pressures will be similar among people of the

same age, gender and ethnicity. Thus, the probability of data being missing is somewhat dependent on

other attributes and is not truly random. In such cases, if the missing data is smaller proportionally,

we can simply remove those. If however they are substantial, we can interpolate using linear or logistic

regression against the conditional attributes.

3. Missing Not At Random (MNAR): When the data is neither MCAR or MAR and when the data

depends on the value itself, it is harder to x. For example, individuals rarely provide their weight or

their age given personal apprehensions and cultural stigmas. In these cases missing data drastically

aects the distribution and hence the nal inference of the data. Bias and inaccuracies are common

and the popular press is rife with such mishaps. The best technical solution is to model the missing

data along with your responses or targets.

0.3.1 Practical Handling of Missing Data:

The following are the various ways to deal with missing data

1. You can eliminate rows that contain the missing data point. If you are working with missing data that

occurs < 5%, and you know the missingness is due to complete randomness(MCAR) or is conditionally

random (MAR), then you can choose to simply eliminate the rows.

2. You can also choose to eliminate columns if you think the attribute will not provide any useful infor-

mation in your analysis. This approach can also be taken if you are missing a substantial amount of

data (50 to 60% or greater) and that including this data attribute will heavily bias your inference.

3. Estimate missing values

4. In the case of categorical data, you can choose to use Logistic Regression to predict the missing

classes/labels. However this will only work if the reason for missingness is MCAR or MAR. In the

case of MNAR you cannot use existing predictors and many of the missing classes will not have

sucient ground truth. In such cases, you may have to jointly learn the missing value along with the

missingness.

5. In the case of continuous data, you can choose to use simple aggregate measures like mean, median

or mode as those typically do not aect the overall distribution of the attribute. However that only

works when the values missing are not substantial (< 5%). If larger number of missing values is the

case, you can apply linear regression using existing predictors (use all in case of MCAR or use feature

subset selection in addition to linear regression for MAR).

6. Another simpler strategy is to use random sampling based on probability distribution of given data.

This works really well when you know the missingness is due to MCAR.

8