# CS代考 Chapter 3: Data Preprocessing – cscodehelp代写

Chapter 3: Data Preprocessing
n Data Preprocessing: An Overview
n Data Quality
n Major Tasks in Data Preprocessing

n Data Cleaning
n Data Integration
n Data Reduction
n Data Transformation and Data Discretization n Summary

Data Reduction Strategies
n Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results
n Why data reduction? — A database/data warehouse may store terabytes of data. Complex data analysis may take a very long time to run with the complete data set.
n Data reduction strategies
n Dimensionality reduction (e.g., remove unimportant attributes)
n Wavelet transforms
n Principal Components Analysis (PCA) n Attribute subset selection
n Numerosity reduction (some simply call it “Data Reduction”)
n Regression and Log-Linear Models, Histograms, clustering,
n Data cube aggregation
n Data compression

Data Reduction 1: Dimensionality Reduction – Attribute Subset Selection
n Attribute Subset Selection: It reduces the data set size by removing irrelevant or redundant attributes (or dimensions)
n Redundant attributes
n Duplicate much or all of the information contained in one or
more other attributes
n E.g., purchase price of a product and the amount of sales
n Irrelevant attributes
n Contain no information that is useful for the data mining task at hand
n E.g., students’ ID is often irrelevant to the task of predicting students’ GPA

Data Reduction 1: Dimensionality Reduction – Attribute Subset Selection
n Goal: Find a minimum set of attributes so that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes.
n For n attributes, there are 2 n possible attribute combinations.
n Exhaustive search: An exhaustive search for the optimal subset of attributes can be overly expensive, especially as n and the number of data classes increase.
n Heuristic methods: They explore a reduced search space, which is commonly used for attribute subset selection.
n These methods are typically greedy in that, while searching through attribute space, they always make what looks to be the best choice at the time.

Heuristic Search in Attribute Selection
n Typical heuristic attribute selection methods:
n Stepwise forward selection: The procedure starts with an empty set of attributes as the reduced set. The best of the original attributes is determined and added to the reduced set. At each subsequent iteration or step, the best of the remaining original attributes is added to the set.
n Stepwise backward elimination: The procedure starts with the full set of attributes. At each step, it removes the worst attribute remaining in the set.
n Note that how to determine the best/worst attribute is not easy, and will be discussed later.
n Combination of forward and backward elimination: The stepwise forward selection and backward elimination methods can be combined so that the procedure selects the best attribute and removes the worst simultaneously.

Heuristic Search in Attribute Selection
n Decision tree induction: Decision tree algorithms (e.g., ID3, C4.5, and CART) were originally intended for classification.
n Decision tree induction constructs a flowchart-like structure where each internal (non-leaf) node denotes a test on an attribute, each branch corresponds to an outcome of the test, and each external (leaf) node denotes a class prediction.
n At each node, the algorithm chooses the “best” attribute to partition the data into individual classes.
n When decision tree induction is used for attribute subset selection, a tree is constructed from the given data.
n All attributes that do not appear in the tree are assumed to be irrelevant.
n The set of attributes appearing in the tree form the reduced subset of attributes.

Heuristic Search in Attribute Selection

Data Reduction 2: Numerosity Reduction
n Reduce data volume by choosing alternative, smaller forms of data representation
n Parametric methods
n Assume the data fit some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers)
n E.g. regression
n Non-parametric methods
n Do not assume models
n Major families: histograms, clustering, sampling …

Parametric Data Reduction: Regression
n Linear regression
n Data modeled to fit a straight line
n Often solved using the least square method
n Once the model is available, the model is kept and the original data are discarded.
n Multiple regression
n Allows a response variable Y to be modeled as a linear function of two or more variables

Mathematically…
n Linearregression:Y=wX+b
n Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
n Often applying the least squares method to the known values of Y1, Y2, …, X1, X2, ….
n Multiple regression: Y = b0 + b1 X1 + b2 X2
n Many nonlinear functions can be transformed into the above

Histogram Analysis
n Histograms can be used to approximate data distributions.
n Histograms were introduced in Section 2.2.3.
n A histogram for an attribute, X, partitions the data distribution of X into disjoint subsets, referred to as buckets or bins.
n Approximation at different levels results in different levels of data reduction.
n For example:
n A list of AllElectronics prices for commonly-sold items (rounded
to the nearest dollar) can be found below.
The prices have been sorted: 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30, 30.

n The following figure shows a basic histogram for the data.
n To further reduce the data, it is common to have each bucket denote a value range. In the following figure, each bucket represents a different \$10 range for price.
n In this scenario, other statistical quantities, such as bucket mean, could be used to help approximate data distribution.
Histogram Analysis

Histogram Analysis
How are the buckets determined and the attribute values partitioned?
n There are several partitioning rules, including the following:
n Equal-width: In an equal-width histogram, the width of each bucket range is uniform (e.g., the width of \$10 for the buckets in the previous figure).
n Equal-depth (or equal-frequency): In an equal-frequency histogram, the buckets are created so that, roughly, the frequency of each bucket is constant (i.e., each bucket contains roughly the same number of contiguous data samples).

Clustering
n Partition data set into clusters based on similarity, and store cluster representation (e.g., centroid and diameter) only
n The original data is discarded
n Can be very effective if data is clustered but not if
data is “smeared”
n There are many choices of clustering definitions and clustering algorithms
n Cluster analysis will be studied in detail in Chapter 10

n Sampling: obtaining a small sample s to represent the whole data set N
n It allows the execution of a mining algorithm with a complexity that is potentially sub-linear to the size of the data
n Key Principle: Choose a representative subset of the data
n Simple random sampling may have very poor performance in the presence of skew

Types of Sampling
Suppose that a large data set, D , contains N tuples.
n Simple random sample without replacement (SRSWOR) of size s: This is created by drawing s of the N tuples from D (s < N), where the probability of drawing any tuple in D is 1/N, that is, all tuples are equally likely to be sampled. n Simple random sample with replacement (SRSWR) of size s: This is similar to SRSWOR, except that each time a tuple is drawn from D, it is recorded and then replaced. That is, after a tuple is drawn, it is placed back in D so that it may be drawn again. Types of Sampling SRSWOR vs. SRSWR Types of Sampling n Cluster sample: If the tuples in D are grouped into M mutually disjoint “clusters”, then an SRS of s clusters can be obtained, where s < M. n For example, tuples in a database are usually retrieved a page at a time, so that each page can be considered a cluster. A reduced data representation can be obtained by applying, say, SRSWOR to the pages, resulting in a cluster sample of the tuples. Types of Sampling n Stratified sample: If D is divided into mutually disjoint parts called strata, a stratified sample of D is generated by obtaining an SRS at each stratum. This helps ensure a representative sample, especially when the data are skewed. n For example, a stratified sample may be obtained from customer data, where a stratum is created for each customer age group. In this way, the age group having the smallest number of customers will be sure to be represented. Types of Sampling Data Reduction 3: Data Compression n String compression n There are extensive theories and well-tuned algorithms n Typically lossless, but only limited manipulation is possible n Audio/video compression n Typically lossy compression, with progressive refinement n Sometimes small fragments of signal can be reconstructed without reconstructing the whole Chapter 3: Data Preprocessing n Data Preprocessing: An Overview n Data Quality n Major Tasks in Data Preprocessing n Data Cleaning n Data Integration n Data Reduction n Data Transformation and Data Discretization n Summary Data Transformation n A function that maps the entire set of values of a given attribute to a new set of replacement values so that each old value can be identified with one of the new values n Attribute/feature construction: New attributes constructed from the given ones to help the mining process (“area”, which is based on “width” and “height”, could generated) n Aggregation: Summarization (e.g. daily sales => monthly total)
n Normalization: Scaled to fall within a specified range 23

Data Transformation
n Discretization (for numeric data): Raw values of a numeric attribute (e.g., age) are replaced by interval labels (e.g., 0– 10, 11–20, etc.) or conceptual labels (e.g., youth, adult, senior).
n Concept hierarchy generation (for numeric and nominal data): Attributes such as street can be generalized to higher-level concepts, like city.

Normalization
n Min-max normalization: to [new_minA, new_maxA]
v’= v-minA (new_maxA -new_minA)+new_minA
maxA -minA
n Ex. Suppose that the income range \$12,000 to \$98,000 is
normalized to [0.0, 1.0]. Then \$73,600 is mapped to:
73,600 -12,000 (1.0 – 0) + 0 = 0.716 98,000 -12,000
n Min-max normalization preserves the relationships among the original data values.
n It will encounter an “out-of-bounds” error if a future input for normalization falls outside of the original data range.

Normalization
n Z-score normalization (μ: mean, σ: standard deviation):
n Ex. Let μ = 54,000, σ = 16,000, then \$73,600 is mapped to: 73,600-54,000 =1.225
n This method of normalization is useful when the actual minimum and maximum of attribute A are unknown, or when there are outliers that dominate the min-max normalization.

Normalization
n Normalization by decimal scaling: It normalizes by moving the decimal point of values of attribute A.
n The number of decimal points moved depends on the maximum absolute value of A.
n A value, v, is normalized to v’ by computing: