# CS代考 COMP3115 Exploratory Data Analysis and Visualization – cscodehelp代写

COMP3115 Exploratory Data Analysis and Visualization

Lecture 3: Python libraries for data analytics: NumPy, Matplotlib and Pandas

Introduction to Python Programming (Cont’d) – refer to week2’s slides

Exploratory Data Analysis by Simple Summary Statistics (Cont’d) – refer to week1’s slides

Copyright By cscodehelp代写 加微信 cscodehelp

A Brief Introduction to Python Libraries for Data Analytics: NumPy, Matplotlib and Pandas

An Example of using Pandas to Explore Data by Summary Statistics

NumPy stands for Numerical Python and it is the fundamental package for scientific computing with Python. NumPy is a Python library for handling multi-dimensional arrays.

It contains both the data structures needed for the storing and accessing arrays, and operations and functions for computation using these arrays.

Unlike lists, the arrays must has the same data types for all its elements.

The homogeneity of arrays allows highly optimized functions that use arrays as their inputs and outputs.

The major features of NumPy

Easily generate and store data in memory in the form of multidimensional array

Easily load and store data on disk in binary, text, or CSV format

Support efficient operations on data arrays, including basic arithmetic and logical operations, shape manipulation, data sorting, data slicing, linear algebra, statistical operation, discrete Fourier transform, etc.

Vectorised computation: simple syntax for elementwise operations without using loops (e.g., a = b + c where a, b, and c are three multidimensional arrays with same shape).

How to use NumPy?

In order to use NumPy, we need to import the module numpy first. A widely used convention is to use np as a short name of numpy.

Usages of high-dimensional arrays in data analysis Store matrices, solve systems of linear equations, compute

eigenvalues/eigenvectors, matrix decompositions, …

Images and videos can be represented as NumPy arrays

A 2-dimensional table might store a input data matrix in data analysis, where row represents a sample, column represents a feature (Commonly used in Scikit-learn).

Representation of 2-dimensional table

We obtain information about cases (records) in a dataset, and generally record the information for each case in a row of a data table.

A variable is any characteristic that is recorded for each case. The variables generally correspond to the columns in a data table.

A Variable

List vs. Array

Arrays need extra declaration while lists don’t.

Lists are generally used more often between the two, which works fine most

of the time.

If you’re going to perform arithmetic functions to your lists, you should really be using arrays instead.

Arrays will store your data more compactly and efficiently.

Creation of arrays

Four different approaches to create ndarray objects

1. Use the numpy.array() function to generate an ndarray object from any

sequence-like object (e.g., list and tuple)

2. Use the build-in functions (e.g. np.zeros(), np.ones())to generate some special ndarray object. Use help( ) to find out the details of each function.

3. Generate ndarray with random numbers (random sampling). The numpy.random module provides functions to generate arrays of sample values from popular probability distributions.

4. Save ndarray to disk file, and read ndarray from disk file (e.g.np.load())

1. Creation of arrays (numpy.array()) Import the NumPy library

– Suggested to use the standard abbreviation np

Give a (nested) list as a parameter to the array

constructor

– One dimensional array: list

– Two dimensional array: list of lists

– Three dimensional array: list of lists of list

One dimensional array, Two dimensional array, Three dimensional array

In two dimensional array, you have rows and columns. The rows are indicated as “axis 0” while the columns are the “axis 1”

The number of the axis goes up accordingly with the number of the dimensions.

2. Creation of arrays (build-in functions)

Useful function to create common types of arrays

– np.zeros(): all elements are 0s

– np.ones(): all elements are 1s

– np.full() : all elements to a specific value

– np.empty(): all elements are uninitialized

– np.eye(): identity matrix: a matrix with elements on the diagonal are 1s, others are 0s

– np.arrange():generate evenly spaced values within a given interval.

3. Creation of arrays (random sampling)

The numpy.random module provides to generate arrays of sample values from popular probability distributions.

4. Creation of arrays (from disk file)

Binary format (which is not suitable for human to read)

Txt format (which is suitable for human to read)

Array types and attributes

An array has several attributes: – ndim: the number of dimensions – shape: size in each dimension

– size: the number of elements

– dtype: the type of element

One dimensional array works like the list.

For multi-dimensional array, the index is a comma separated tuple instead of single integer

Note that if you give only a single index to a multi-dimensional array, it indexes the first dimension of the array.

Slicing works similarly to lists, but now we can have slices in different dimensions.

We can even assign to a slice

Extract rows or columns from an array

Arithmetic Operations in NumPy

The basic arithmetic operations in NumPy are defined in the vector form. The name vector operation comes from linear algebra

– addition of two vectors 𝐚 = [𝑎1, 𝑎2], 𝐛 = [𝑏1, 𝑏2] is element-wise addition 𝐚 + 𝐛 = [𝑎1 + 𝑏1, 𝑎2 + 𝑏2]

Arithmetic Operations – +: addition

– -: subtraction

– *:multiplication – /:division

– //:floor division – **: power

– %: remainder

Aggregations: max, min, sum, mean, standard deviations…

Aggregations allow us to describe the information in an array by using few numbers

Aggregation over certain axes

Instead of aggregating over the whole array, we can aggregate over certain axes only as well.

Comparisons

Just like NumPy allows element-wise arithmetic operations between arrays, it is also possible to compare two arrays element-wise.

We can also count the number of comparisons that were True. This solution relies on the interpretation that True corresponds to 1 and False corresponds to 0.

Another use of boolean arrays is that they can be used to select a subset of elements. It is called masking.

It can also be used to assign a new value. For example the following zeroes out the negative numbers.

Fancy Indexing

Using indexing we can get a single elements from an array. If we wanted multiple (not necessarily contiguous) elements, we would have to index several times.

That’s quite verbose. Fancy indexing provides a concise syntax for accessing multiple elements.

We can also assign to multiple elements through fancy indexing.

Matrix operations

NumPy support a wide variety of matrix operations, such as matrix multiplication, solve systems of linear equations, compute eigenvalues/eigenvectors, matrix decompositions and other linear algebra related operations.

The matrix operations will be discussed in detail when we talk about machine learning algorithms.

Matplotlib (brief introduction)

Matplotlib

Visualization is an important technique to help to understand the data.

Matplotlib is the most common low-level visualization library for Python.

It can create line graphs, scatter plots, density plots, histograms, heatmaps, and so on.

Simple Figure

Simply line plot

Two line plots in the same figure

SubFigures

One can create a figure with several subfigures using the command plt.subplots.

It creates a grid of subfigures, where the number of rows and columns in the grid are given as parameters.

It returns a pair of a figure object and an array containing the subfigures. In matplotlib the subfigures are called axes.

Scatter plots

The scatterplot is a visualization technique that enjoys widespread use in data analysis and is a powerful way to convey information about the relationship between two variables.

More examples will be shown in lab session.

2D array in Numpy is not easily to be interpreted without external information. One solution to this is to give a descriptive name to each column. These column names stay fixed and attached to their corresponding columns, even if we remove some of the columns.

In addition, the rows can be given names as well, these are called indices in Pandas.

High performance open-source Python library for data manipulation and analysis

– We usually use an alias pd for pandas, like np for numpy

Pandas Data Structures

1-dimensional: Series

– 1D labeled homogeneously-typed array 2-dimensional: DataFrame

– General 2D labeled, size-mutable tabular structure with potentially heterogeneously-typed column

Series is a 1-dimensional labeled array, capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).

All data items in a Series must be the same data type

Each item in the array has an associated label, also called index. Why do we need labels?

– Like a dict, a data time can be quickly located by its label.

– Remark 1: Unlike a dict, labels in a Series don’t need to be unique

– Remark 2: Unlike a dict, the size of a Series is fixed after its creation

Creation and indexing of series

Series is one-dimensional version of DataFrame. Each column in a DataFrame is a Series.

One can turn any one-dimensional iterable into a Series, which is a one-dimensional data structure.

We can attach a name to this series

Row indices of Series

Series can be created by pd.Series()

In addition to the values of the series, also the row indices were printed. All the accessing methods from NumPy arrays also work for the Series: indexing, slicing, masking and fancy indexing.

Note that the indices stick to the corresponding values, they are not renumbered!

The values of Series as a NumPy array are accessible via the values attribute.

The indices are available through the index attribute.

The index is not simply a NumPy array, but a data structure that allows fast access to the elements.

It is still possible to access the series using NumPy style implicit integer indices.

This can be confusing though.

Pandas offers attributes loc and iloc. The attributes loc always uses the explicit index, while the attribute iloc always uses the implicit integer index

The Pandas library is built on top of the NumPy library, and it provides a special kind of two dimensional data structure called DataFrame.

The DataFrame allows to give names to the columns, so that one can access a column using its name in place of the index of the column.

Creation of Dataframes

The DataFrame is essentially a two dimensional object, and it can be created in four different ways:

– from a two dimensional NumPy array – from given columns

– from given rows

– from a local file

1. Creating DataFrames from a NumPy array

In the following example a DataFrame with 2 rows and 3 column is created. The row and column indices are given explicitly.

If either columns or index argument is left out, then an implicit integer index will be used.

2. Creating DataFrames from columns

A column can be specified as a list, an NumPy array, or a Pandas’ Series.

Input is a dictionary, keys give the column names and values are the actual column content.

3. Creating DataFrames from rows

We can give a list of rows as a parameter to the DataFrame constructor.

4. Creating DataFrames from a local file

Import ‘iris.csv’

We see that the DataFrame contains five columns, four of which are numerical variables.

Accessing columns and rows of a Dataframe

We can refer to a column by its name

Recommend to use attributes loc and iloc for accessing

columns and rows in a dataframe.

loc uses explicit indices and the iloc uses the implicit integer indices.

Drop a column

We can drop some columns from the DataFrame with the drop method.

We can use the inplace parameter of the drop method to modify the original DataFrame.

Many of the modifying methods of the DataFrame have the inplace

parameter.

Add a new column in DataFrame

Summary statistic methods on Pandas columns

There are several summary statistic methods that operate on a column or on all the columns.

Summary statistics of Pandas

The summary statistic methods work in a similar way as their counter parts in NumPy. By default, the aggregation is done over columns.

The describe method of the DataFrame object gives different summary statistics for each (numeric) column. The result is a DataFrame. This method gives a good overview of the data, and is typically used in the exploratory data analysis phase.

Use Summary Statistics to Explore Data

Use Summary Statistics to Explore Data (in Python)

Load and see the data

Feature Type

Show the feature types

Summary statistic in python

Standard Deviation Min

Q2 (Median)

statistic in python: mean; standard deviation; median

Compute the mean, standard deviation and median for the feature ‘sepal_length’

Summary statistic in python: Mode

Get counts of unique values for ‘species’

Compute the mode for ‘species’

Summary statistics for different groups

Explore more about the data

The petal_length of ‘setosa’ is always smaller than the petal_length of ‘versicolor’

Summary statistics for different group

Explore more about the data

The petal_width of ‘setosa’ is always smaller than the petal_width of ‘versicolor’

Summary statistics for different group

Explore more about the data

• We have our own way to distinguish ‘setosa’ and ‘versicolor’ (i.e., using petal_length or petal_width)

• The knowledge is gained by performing simple summary statistics techniques on data.

The pattern is more obvious when we visualize it

The petal_length of ‘setosa’ is always smaller than the petal_length of ‘versicolor’

The petal_width of ‘setosa’ is always smaller than the petal_width of ‘versicolor’

It is very easy to classify ‘setosa’ and ‘versicolor’ just based on petal_length and petal_width.

程序代写 CS代考 加微信: cscodehelp QQ: 2235208643 Email: kyit630461@163.com